Numerical aerodynamic simulation facility feasibility study by unknown
• 4. ,,0_,
!
I
I
i
[
I
I
!
I
i I
f
t
I
[
I
[
I
I
UNCLASSIFIED
FINAL REPORT
NUMERICAL AERODYNAMIC SIMULATION FACILITY
FEASIBILITY STUDY
(_1^;; A-C?<- _ %2 {o <) N)Jq r. rt] C AT A 1':I'(; I; Y iq/_,_ J C
_;IH(IlAI'LCFI1 FACILI.'I¥ F_,_',3I[_]I.T']Y %qr]_'f FJ. ltd]
[_(,l_oct (l. llL'l?c'Jgh_; C<JL'!..) f 1 r- p
I1(." h_)'J '_t, ._O1 (.'..;('l_ lt47 _
_ 7 9 - 'i}t'. 0 7 2
!1TI¢;'L ;I ;
(; I/0 } 2 8 t4 9 t!
Distribution uf thb report is provided in the interest of information
exchange. Responsibility for the contents resides in the author or
organization that prepared it.
Prepared under Contract No. NAS2-9897 by
Burroughs Corporation
Paoli, Pa.
for
AMES RESEARCH CENTER
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
UNCLASSIFIED
https://ntrs.nasa.gov/search.jsp?R=19790017901 2020-03-21T22:02:02+00:00Z
'.RRATA
Chapter 4
pages 4-43, 4-44, 4-45; "I,OCTRH" to read as "LOCA'IqON"
page 4-56, after last line should b(: appended:
"2. l]andling of the special coustructs used to change
terl_il]al characteri_:tics and tile system's responses
to tile terminals."
Chapter 5
page 5-12, under P hffsica]; should read as "Size: ]..2" x
I]..5" × 27.5"
page 5-38, under Stonra_g_e_ Ca__Dacities ; shou]d rudd as "65,536
wo rd s/rood u i e"
page 5-53, under #4, line 3 to read as "000000011"
page 5-54 bottom of page. The modified equation i._
Port = 32 x (EMno NOD 512) + 2 x (EMno MOI)4) + 1
for 5112 < ,'2,1no < 527
which al lows better distribution of the spare
module:s.
Chapter 7
page 7-7, paragraph 4, line 7; "CU" should read as "C]{"
page 7-18, paragraph ] ; line 5 should read as "would appre-
ciably improve throughput."
Appendix A
page A-3, equation A.]. fi and t i are the number of floating
point operations and the execution time of the ith
piece, respectively.
page A-23, third bullet, line i; "The correct algorithm" should
read as "The given alg_o_r_i_"
page A-35, paragraph 2, line 2; LAX should be deleted.
page A-40, paragraph 2, line 4 "NJ(J) should read as NM(J)"
page A-59, paragraph 2, line 8 should read as "the physical
problem needs to be retained"
page A-63, second equation "TEM=I44 '' should read as "TEM + 144"
ERRATA(Continued)
AppendixB
page B-32, "60%" column, "Double Omega, 512/512" row,
"0.504" should read as "0.0504".
page B-40, paragraph 2, line 2; "of that" should read as
"of requests that".
page B-45, paragraph 4, lille 5; "0 i I0" should read as
0 -< i _.i0"
Appendix C
page C-8, paragraph i, line 9; "SOTREM" should read as STOREM"
page C-24, under LOADEM, line 2; "TN" should read "CN".
page C-27, Under FILLRE; "FILIR" should read "FILLR".
page C-40, line 4; "CTIX_" to read as "CTIXI"
Appendix D
page D-10 paragraph i, lines 6, 7, 8; "Hence, (I-F1) is the
fractSon of failures that cause a transition
directly into the INTERRUPT state, " should be
deleted.
page D-12, under _IME BETWEEN FAILUPES (PERMANENT), line 3;
"intermittant type device failure" to read as
"permanent type device failure".
Appendix F
page F-39, paragraph 5, line 9, "15,38K" to read as "15.38K".
Appendix H
page H-15, equation H.3 to read as P(A-UPPERffil) = P(INPUT) x
P(0-BIT=I) x P(I-BIT=I)
UNCLASSIFIED
FINAL REPORT
NUMERICAL AERODYNAMIC SIMULAEION FACILITY
FEASIBILITY STUDY
March 1979
Distribution of this report is provided in the interest of information
exchange. Responsibility for the contents resides in the author or
organization that prepared it.
Prepared under Contract No. NAS2-9897 by
Burroughs Corporation
Paoli, Pa.
for
AMES RESEARCH CENTER
NATIONAL AERONAUTICS AND SPACE ADMINISTRA|'ION
UNCLASSIFIED
It
!
!
I NTI_,ODU CT 1 ON
This report pL-e s,;n ts the results oi_ Burroughs ('orl)orat _ ,,n ' s
efforts on the Feasibility Study for t:Jl__ Numericd] A_;rodyllaluJc
Simulation Facility (NASF). The study has dc, lnolmtrat,_d that a
particular form and architecture for the NAS[ _ (p_:opo:;ed ,miginaLly
during the Preliminary Study [i, 2] and improv__d during t.hc
present study) would meet the established objectives. The
Numerical Aerodynamic Simulation Facility is conceived t,) b(! m,_re
t}lan just a very high-speed computing machine. The Jac]Jity i:uJ:_t
also include all that is required to supl?ort the us_-Is c_l such ,_
hzgh-speed capability, 'Phe I easibility study _-,.,]ui_-,:.]
consideratiou of all parts of the proposed NAHF sy:]ttl,a. '|'hm _h_utll
of study of each part of the system varied d(;L_,,iMin. I el) Lll_-
com]2]_exity of that ])art of the system, on the impact of L Jlat p_Jrt
on the systel_ capabilities and on wht_ther or not thc_rc wa:_
suffici_nt prior knowledge about how to implement that pa_:t el tlJ{:
system.
The evaluations performed as part of th_ study focused ou th__:(-
major issues. FiL'st the ability of the _ropos_d system archit_:c-
ture to support the anticipate(] workload was (:valuated. _eeond,
th_ t h t-oug hput o[ the computationa] engine (t h,." F]_)w Mode]
Proces,_;or) was stu_lied using real application programs. ']'}_]r_],
the availability r<.li,_b[lity, and maintainability of the system
wer(_ modeled. The eva l uations were based on tht, Baseline SystoH]s
of the Preliminary StudJ._s [i, 2] as modified where aPl_ropriate
during this study.
The results of these evaluations show that the implementation of
the: NASF, in the form c(_nsidered, would indeed b(: a feas/blt' pro-
ject with an acceptable level of risk. The technology requ_ red
(both hardware an(] software) either already exists or, in the case
of a few parts, is expected to be announce(] this year.
This report describes many o[ the details of the system including
the hardware configuration, user language, software, fault toler-
ance, and other aspects el the system on which this demonstration
of feasibility is based. The first chapter summarizes the study
objectives the evaluations made and the rusult:_. The NASF system
architecture, which is the basis of discussion throughout the
report, is described in Chapter 2. The system-level Loading anal-
ysis performed as part of the study is summarized in Chapter 2
while Chapter 3 reports on the results of timing actual cod<::-{ f<_<
the configurations assumed. The NASF Software and Hardware ,h,ve-
lopments are detai Led in Chapters 4 and 5. The vario,_s m(),.h_'ls
used to evaluate reliability, availability, maintainability, trusL-
worthiness an(] the results of that detailed evaluat [on a_:('
included in Chapter 6. Chapter 7 describes the models wi_i.ch have
been used use(] during Flow Model Processor (FMP) in:_t l-uct i(_n
timing simulations. The report concludes with a chapLe[ which
identifies some o[ the management and control tcchniqu(,s which
couhq be used to eventual ly manage a project of this scope, l:w,n
more detail concerning most of the art, as d_scussed in t_e _-<'_),n-t
is includ_ed in the Appendices. Each of the chapt{:_s in,:]ud<,,_; .u_
introductory section which can be scanned to gain a g(:n,,ra ]
perception of each parr o[ the project after reading Chaptor I.
iii
iJ
I
!
Z
i
I
i
Ch aai_te r/Se Ct ion
Introduct ion
Content s
J.
i.i
1.2
1.2.1
1.2.2
1.2.3
1.3
1.3.1
]..3.2
.]..3.3
2
2.1
2.2
2.2.1
2.2.2
2.2.2.1
2.2.2.2
2.2.2.3
2.3
3
3.1
3.2
3.2.1
3.2.2
3.3
3.3.1
3.3.2
3.3.3
3.3.4
3.3.5
3.3.5.1
3.3.6
C()NTI]NTS
Title
STUDY OBJ}_CTIVES ANI) RESULTS
STUDY OBJ ECTIV}]S
SYSTEM DESCRIPTION
}]ardwarc
Software
Vault Tolerance
NASF EVALUATION
System Utilization Studies
Flow Model Processor Throughput
Evaluat ion
Availability, Reliability, and
Maintainability Evaluations
Program Success Assurance
CONCLUS ION
NASF SYSTEM ARCHITEC'2Ui<b
OPERATIONAL ENVIRONMENT
SYSTEbl DESCRIPTION
FMP
Support Processing System
Support Processor
Peripheral Support System
File System
SYSTEM UTILIZATION STUDIES
iii
v
i-I
1-]
1-2
1-3
i-4
i-5
I-5
1-6
1-6
1-8
i-8
1-9
2-1
2-1
2-2
2-2
2-2
2-5
2-5
2-6
2-7
APPLICATION ANALYSIS 3-1
INTRODUCTION 3-1
PRODUCTION APPLICATIONS 3-1
Functional Requirements 3-1
Proj_:cted Performance. Summary 3-3
PERFORbIANCE PROJECTION BAS[]D ON BENCHMARK
PROGRAMS 3-3
Summary 3-3
Me thod 3 - 4
Throughput of Acre Flow Codes 3-5
%h2ather and Ciimate Codes 3-6
App!ications Beyon(, the Benchmarks 3-7
I,argo Problei_s 3-
Application Domain 3-Ii
V
PRF.CEDINC PACE BI.ANK NOT FILMEr)
Chapter/Section
4
4.1
4.2
4.2.1
4.2.[.i
4.2.1.2
4.2.2
4.2.2.i
4.2.2.2
4.2.2.2.1
4.2.2.2.2
4.2.2.2.3
4.2.2.2.4
4.2.2.2.5
4.2.2.2.6
4.2.2.2.7
4.2.2.2.8
4.2.2.3
4.2.2.3.1
4.2.2.3.2
4.2.2.3.3
4.2.2.3.4
4.2.2.4
4.2.2.5
4.2.2.5.1
4.2.2.5.2
4.2.2.5.3
4.2.2.5.4
4.2.2.5.5
4.2.2.6
4.2.2.6.1
4.2.2.6.2
4.2.2.6.3
4.2.2.7
4.2.2.8
4.2.2.9
4.2.210
4.2.2.10.1
4.2.2.10.2
4.2.2.10.3
4.2.2.10.4
4.2.2.10.5
4.2.2.11
4.2.2.12
4.2.2.12.1
4.2.2.12.2
4.2.2.12.3
4.2.2.12.4
4.2.2.12.5
4.2.2.13
Titlr_
SOFTWARE
I NTRODUCT ION
FMP FORTRAN
Language Design Considerations
Complex ity
Abstraction and Modeling
Language Constructs
Intr_oducto_y Example
Geomet ry
DOMAIN Declarations
Examples
Rest rict ions
Scope
Required Orde L"
Application and Usage
Mapping
REGION Statement
Model State
INALL Declarations
Scope
Application and Usage
Mapping
Process Modeling
DOALL Construct
Construct Definition
Serial FORTRAN Equivalent Form
Nested DOALLS
flapping
Restrict ions
Variable Referencing
Referencing Within a DOALL-
program-segment
Centered Subscripts
Unreferenced Variables
Storage Allocation
Independent Compilation
Code Generation
Funct ions
Global Functions
LOCAT I ON
RECURRANCE
Efficiency of Global Functions
Di[-ect Calls on Global Functions
Assignment Statements
Miscellaneous Feat ures
Same-line Comments
Recurs ion
DO LOOPS
EXIT Statement
Dynamic Array Sizes
Input Output
4-I
4-1
4-2
4-2
4-3
4-3
4-5
4-5
4-8
4-8
4-I1
'_-12
4-12
4-12
4-12
4-13
4-14
4-15
4-17
4-18
4-18
4-18
4-18
4-20
4-22
4-25
4-25
4-25
4-25
4-26
4-26
4-28
4-30
4-31
4-31
4-32
4-33
4-33
4-43
4-43
4-44
4-45
4-46
4-46
4-46
4-47
4-47
4-47
4-47
4-47
v_
I!
-|
I
t
C!Ld/)tger_/S_ec_ti__°9 Tit1_ Pj]_i9
4.2.3 FMP J'ORTI{A]_ Compiler 4.48
4.2.3.1 Functional Objectives 4-48
4.2.3.1.i h_upport to the User 4-48
4.2.3.1.2 Support of the Language 4-48
4.2.3.1.3 Make E[ficient Use of FMP Resources 4-48
4.2.3.1.4 Support the Operational
l]nvironment 4-49
4.2.3.2 Functional O!rganization 4-50
4.2. 3.3 Domains 4- 50
4.2.3.4 Data Flow Analysis 4-50
4.2.3.5 MumoL"y Allocation 4-50
4.2.3.6 Subroutine Entry and Return 4L52
4.2.3.7 Concurrency 4-52
4.2.3.8 Duplexed Computation Model 4-52
4.3 OPERATING SYSTEM 4- 53
4.3. i Assumpt ions 4- 54
4.3.1.i Computational Envel()T)e 4- 54
4.3.2 B7800 MCP 4-54
4.3.2.1 InterruDt Handling 4-55
4.3,2.2 MemoL_y Mal_age]nent 4-55
4. 3. 2.3 ,',ICP I/O Handling 4-55
4.3.2.4 Process Control 4-56
4.3.2.5 Peripheral Control 4-56
4.3.2.6 Work Flow Management 4-56
4.3.2.7 Data Communications 4-56
4.3.3 Integration of FMP Task Management
into IICP 4-57
4.3.3.1 Limitations 4-57
4.3.3.2 Interrupt Handling 4-57
4.3.3.3 Memory 14anagement 4-58
4. 3.3.4 Process Control 4-58
4.3.3.5 Work Flow Management 4-58
4.3.3.6 Utilities 4-58
4.3.4 FMP Portion of MCP 4-58
4. 3.5 File Management 4-59
4.3.5.1 FMP Interaction with File Subsystem 4-60
4.3,6 Job Structure 4-61
4.3.6.1 Organization of a Job 4-61
4.3.6.2 Flow o[ Job 4-62
4.3.7 Program Load and Overlay Support 4-63
4.3.8 Cperations Support 4-63
4.3.8.1 Performance Monitoring 4-63
4.3.8.2 System Initialization 4-64
4.4 OTll ER SOFTWA_<L] I{EQU I_<EME]NTS 4- 65
4.5 CONCLUSIONS 4- 66
vii
viii
5
5. L
5.]..I
5.1.1.1
5. L.I.2
5,1.i.3
5.1.1.4
5.2
5.2.]
5.2.2
5,2,3
5.2.3,1
5.2.3.2
5.2.3,3
5,2.3.4
5.2.3,5
5.2.4
5.2.5
5.3
5.3.i
5.3.2
5.3.3
5.3.4
5.4
5.4,1
5.4.2
5.4.3
5.5
5.5.1
5._i:2
5.5.3
5.5.4
5.6
5.6.1
5.6.2
5.6.3
5.6.4
5.6.5
5.7
57.1
57.2
57.3
57.4
57.4.[
57.4.2
5.7.4.3
5.7.5
FI,OW blOl)l']L PROCESSGR ( I"HI _) IIARDWARE 5-1
IN'I'I_ODLIC'"l(]I_ 5- l
lk._:_ign Coil._ltrail,t:_and Considerati ons 5-2
Through put 5- 2
Economy 5- 2
IL,rdwar(:/Softwarc Compatibi Lity 5-2
Sch_:du lc 5- 3
FHP ARClt I",'I:C'['IJRI', 5- 3
(.;,:lH._l:a.I ]"low 'tl_rough F'4P 5-5
Chdl/gc..:i frohl l?,asolJ n,: System 5-6
Basic Syst{ml Parameters 5-6
]_)gic Fami.l.y 5-6
CJ.ock Rate 5-7
(:ai).[ing Hctl_ods 5-7
PoWer 5- 7
Numl)er of Processors 5-7
Modularity 5-8
Pr,..viuw of I"HP Colnponcnt l]escriptions 5-8
PROCESSOI{ 5-9
EXccutiorl Unit (EU) 5-9
Proces:_or Ilemory (PH) 5-17
Conllcction Nut:work Buffer (CN Buffer) 5-17
Design Rationale and Changes from
}.)re i [ l;ii nat-y SfOdy- 5- 2 3
COORDIIqATOR (CR) 5-23
Exc'cution Logic 5-29
Coordil:ator Memory 5- 29
Design Rational,_. and Changes from
Preliminary Study 5-29
PROCESSOR-COOI{I)INATOR INTERACTION 5- 31
Instruction Streams 5-31
Syl]¢!_ton i za t i on __52_I I
Interface 5- 31
Fan-Out True (Coordinator-to-Processors) 5-33
EXTENDED HE!-IOi<Y blODULE 5-33
Basic ChaYacteri._tics 5- 36
Connection Network (CN) Interface frora
Processors 5- 36
DBN Interface 5-39
EH Fanout 5- 39
Design Rat iona ].u 5-40
CONNF:CTION NIcTbORK (PROCESSORS TO EXTENDED
_,1]'210RY ) 5- 4 0
I ur_ctiona I. Description 5-43
CN Complexity Considerations <-46
Processor-and h:,qConnection Happing 5-47
llardware Aspcct:_ 5-50
Clocks and Synchronization 5-50
Switch l::/cment 5- 50
Packaqing 5- 55
Design }<ationale and Changes from Prclilninar_
St tldy 5- 55
iI
I
I
I
I
r
i
I
Chapter/Section
5.8
5.8.1
5.8.2
5.8.3
5.9
5 i0
5.11
5 ii.i
5.11.2
5 11.3
5.].2
5 12.]
5.12 I.I
512 [.2
5.12 1.3
5[22
5.12 2.1
512 2.2
5.12 2.3
5 12 2.4
5.12 2.5
-- 5 12 2.6
5.12 2.6.1
5 12 2.6.2
5.12.2.6.3
5 13
6
6.1--
6.1.1
6.1.2
6.1.3
6.1.4
6.1.4.1
6.1.4.2
6.1.5
6.1.5.1
6.1.5.2
6.1.5.2.1
6..[.5.2.4
6.1.5.2.5
Tit io
I)ATA BA:{E MEMORY (DBM)
General Storag_ C},arocteristics
Soft Euror Control
Design Rational< _ and C)l_inges from
Pr,_liminary Study
DATA BASE MEMORY (DBM) CONTROLLER
l) I AGr0()YI'J' I C CO1TI'P, OLLE R DC)
POWER CON_] i DI]RATI ONS
AC Modu.les
Other Power Supplies
Grounding Considerat lons
CIRCUIT AND PACKAGING TI']CHNOLOGY
Iml_icl'aentation T_chnology Update
S uJ]ll;1;l r y
ECL Arrays
BCHL
Packaging
Genera[
Printed Circuit Ass_.n,blies
In te rconne c t i _J_ls
Backplanes
Ca}oinet Frame ;_sembly and Doors
BC_II, Packaging
Genera].
Circuit Packaging
Prame, Cooling & Power
I_PLE._IENTATI ON TOOLS
TRUSTWORTHINESS AND AVAILABILITY
TRUS%%4ORTHINESS, AVAILABILITY, AND
ERROR CONTROL
General Requirements
Design Requirements
Sparing and Duplex Processing
Error Correction in Memories
SECDED
Scrubbing Errors Out of CCD Nemc "y
and Dynamic RAM
Erl:or Detection and Correction in
the Connection Network
Nagnitude of the Problem
Defense vs. Type of Fault
Single transient error in the
request sent to EM
Single transient eL-rot in data
}-lard tailure on the path from
one processor to EM
Hard failure on the data path
from EM to processor
Hard failure in the path-
selecting control logic
5- 57
5-.57
5-59
5-59
5-62
5-64
5-65
5- 65
5-67
5-68
5-69
5-69
5-69
5-70
5- 71
5-71
5-71
5-7]
5-75
5-77
5-77
5-78
5-78
5-78
5-79
5-81
6-1
6-1
6-1
6-2
6-2
6-3
6-4
6-8
6-10
6-11
6-11
6-11
6-11
6-11
6-11
6-11
ix
_Ch at._ttak"/Sec t ion
6.1.5.3
6.1.6
6.1.7
6.1.8
6.1.9
6.1.10
6.1..10.1
6.]..10.2
6.1.10.2.1 -
6.1.10.2.2
6.1.].0.2.3
6.1.10.2.4
6.1.10.2.5
6.1.10.2.6
6. I. 10.3
6.1.]0.9.1
6.1.].0.3.2
6.1.]0.3.3
6.2
6.2.1
6.2.2
6.2.3
6.2.4
6.2.4.1
6.2.4.2
6.2.4.3
6.2.4.4
6.2.4.5
6.2.5
6.2.6
6,2.6.1
6.2.6.2
6.2.6.3
6.2.6.4
Aria ]y'_ is 6- 12
Logical Chc=cks 6- 14
Ru:_t art 6-14
Error Logging 6-16
Invariants 6-i 7
])iugnost its 6-17
Level of Pel:formance 6-20
NASI' Com])uter Assisted Diagnostic
Tools 6-20
Su[dx)rt Procussor System
Diagnostics 6- 20
SUlPPok't Pk-ocussor Periph_,ral
l]xercisus 6-21
I,'MP OiL-line l)iagnostics 6-2]
Off-Line LRU l]iagnostics 6-21
PAL (Prograr,]lning Aid Lor
Logicians) 6-21
Anatysis of Logged Errors 6-21
CN bidgnostics 6-22
Assumptions and Design
Requirements 6-22
],ocalJz[ng a Hard Error in the CN 6-22
])iagnostic Genuration Scheduling 6-24
RELIABILITY, AVAILABILITY AND
,_4AINTA I NAB 1%ITY 6- 25
]{eliability/Availability I1odel 6- 26
V,edundancy Study 6-27
Com])ont:nt Quality Study 6- 30
FMP Reliability and Availability
Predict ion 6- 30
LSI I1emory Fai]ure Rates 6-31
SI]CDED Improvement Factor 6-33
Ratio of Permanent Failures to
Intermitt ant Failures 6-33
Recovery Efficiency 6-34
V_iP Reliability Analysis Results 6-34
Su[)[)ort Processor and File Hanagement
Subsystems 6- 38
Maint unance 6- 45
Maintenance Philosophy 6-45
Maintenance Plan 6-46
Personnel Support Requirements 6-47
Sparing Considerations 4-52
+,,
S'itJ O
7 I,YIP 'I'IIIING SIMULATIOIA';
7.1 FMP MODEl.
7.J .] PL"ocessol7 ModeJ
7, ],.2 PvogFahi Potc}i
7. _. 3 Ili,qtvHct[<)ll Exccutic)ll
7.1.4 Synchronizing Act ion
7,[,5 l:xtcrnaJ Auccss Model
7. I. 6 Branch illU
7, I, 7 Coor_]J nator
7,1.8 }:XiC'll<h2d ,_1_31:1oVyACC_:H:'_
7.I.9 Simukation ]{csults
7.2 SII1UI,AT1ONS PJ.]I{FOP, IflED
7.2.1 Seluc'tud Codes
'7.2.2 TLIRb I)A
7,2.3 AI.1ATRX
7.2.4 BTP,I
7.2.5 GiSS Climate Code Samples
7.2.5.1 AVRX
7.2.5.2 COMP2
7,2.5.3 COI'IP3
7.2.6 3-D Explicit Aero l,'].ow Code
7.2.6. I C}IAI{AC
7.2.6.2 LX/t.'X
7.2.6.3 SORT
7.3 APPLICA'PIONS OF SliqULATOR RESULTS
7-I
7-1
7-[
7-4
7-5
'7-7
7-8
7-8
7-9
7-9
7-10
7-11
7-12
7-12
7-]. 3
7-13
7-14
7-.14
7-14
7-14
7--16
7-18
7-.[8
7-19
7-20
8 SCIIEDULE AND I'ACILITIES
8, i SCIIEDULE
8.f,l Introduction
8.1,2 The OveraLl NASF Progralu
8.i.3 Schedule Management
8.i,4 NASV Schedules
8.]..5 Critical Path
8.2 FACII, ITIES
Schedule
8-1
8-1
8-1
8-1
8-3
8-4
8-9
8-15
REFI{RENCES R-1
xi
C'h apt 9F_/S e_%'t_ij)iI
A. 1
A.2
A.3
A.3.1
A.3.2
A.3.3
A.3.4
A.3.4.1
A,3.4,2
A.3.4.3
A.3.5
A.3.5.1
A.4
A.4.1
A.4:I.I
A.4,1,2
A.4.2
A.4. 3
A.4.4
A.5
A. 5.1
A.5.2
A.5,3
A.5.4
A.6
A.6.I
A.6.2
A.7
A.7.I
A.7.[. ]
A.7.].2
A.7.1.3
A.7.2
'J' ]. t J,.O ])£l(flO
PEI{FOI{NAIJC]< PJ<(]JI.',CTJON IJASI.'D ON I_ENCIJI,_AI<K
P R OG k/W.I ,S A- ].
I NTI _OI)U CT I Olq A- ].
['.1ET H OI) A- 2
TItROU(;III.)UT OF [HPLI('I'J.' AERO FLOW COl)E A-6
StlllllLlar y A- 6
A.'}t$tlllll)ti t)l]:-; A- 0
Analysis ,)J linl;lJc'it Aero ]']ow Code A-7
I.'HP ]"ORTJUd4 VcFsion IY-9
Ol]<:-tO-OlH: _.lal_pil]g fr(:m Serial. FORTI{AN A-9
5H()O']'I{ A-9
13']'RI A- 12
Analy:;J:; A-J6
l)c::_c_;ipt]o_l o£ Tab]L' A. I A-IB
TIIROFGHPI]T OF I]XI?IH[CI']' AERO FLOW CODE A-23
Summary A-23
J{{.'._;lJ.]. [ ._3 A- 23
Oh,';,_'uv at i(ms A- 23
lX:'SUI;LIJL J on:: A- 2 3
!.letllo_l ol: Analysis A-24
Si.mu].ati.on and lhmd Compil.ing A-35
G1S,_ CLII'.IATE Pi-]I<I.'OI<IIANCE I]VALATIOIq A-38
S ui;'_la a r y A- 38
Discussion of the Analysis A-38
F;,1P L'ORTI<AN Version A-42
Rosults A-52
SPEC']'I<A], '_;I;ATIII'R A- 52
Summ ar y A- 52
FMP FORTRAN Version of FFT Portion of
Prog L-am A- 5 5
OTIIER ANAI,YSIS A-60
Fa_;t Fourier Trauforms on the FlIP A-60
Di scuss ion A- 60
Timing Estinlatcs A-62
F:qP FO_<TRAN Version of Glas._man's FFT
AI go_: ithm A-63
A Parallel Sort A-65
xii
Cl!ai{tq_:/Se<:t ]on.
B
B. I
B. 2
B,2.1
B.2. t.1
B.2.[. 2
I_.2. 1.3
B.2. 1.4
I_,. 2, I.. 5
B.2.t .h
B. 2.] .7
B. {,
B. 4
J_. 4.1
];,4.2
B,4.3
B.4.3 1
B 4 3 2
I_ 4 3
B 4 3 4
B 4 3 b
}3430
I_ 4 37
13 4 3 8
B.4.4.1
B. 4. ,1.2
l].4.4.3
B.4.4x4
B, 5
B. 5,1
B,5.2
13.5.3
B.5.4
B. 6
B, 6.]
B.7
B. 7. t
B.7.2
B.7.3
B.7.4
B,7.5
B.7.6
B.8
Ti t l<' Pacje
I.'!IP (I[)IIIII.;CT,[ON Ill,',']'Wf)lO{ - ANAI,¥1_I[_ AND F'.VAI,UATIOII B-1
;_t]IIHAF, Y B-I
BACKGJd)IJNI) B-. l
l),_l inJ t.]oll:; B-3
l'-(]l?d<!l;'_-"_i V_-_'t.oV I_-" ]
]'-(.)-()]cl,_r',_'d VeCtol:" 13-3
J_,llldIJl;I Requ,':;t B- {
B I c)_.'k/l(.]_ ' B- 3
C_NII I J.ol. B"'4
Pi luu I , B-4
]"lTd]ll( ! 1_-- 4
AI WAWfAG 1-;_5 I_- 4
CN I]I',',qCRJ I-"l'J.OIl B-6
V<.,r_;:i.cm:; <)1 l'dc,i-Wol:k:_ Corl._;J._l_:Fc-d B- ] I.
],_g ic l)<:;-;.i _jI] B-14
CN l"ullCt iOl] (!<)flt'l?¢)L,s 1]-]8
"BI)CS'J'/IIV_;T" B- 1_8
" NLI ] I" B-18
" \_L- d J.)/IL'I)Ul3d " l_-- t
]_J.;l,.]ll(]._ t J<] COllll.l{]llil.c." I_--]. 8
1.'1:TCIII]H B- 1.9
ltVS'I' B-19
Cc_o_:dil_4l-or ACO-.'_:4 to l'tl B-19
CN t<_ Coor'dinator Status B-19
Imp] umuntation DutaiJ.s B-19
l:'lip r'lops B-19
Wi l.*J.llg B- 20
Ix)g ic B- 20
Pa F t,q Count B--24
S IHU LA'Ji'iON RESULTS B- 25
S emma ry B- 25
Data B-27
Discu,,{sion oi; Simulation Experiments B-31
Test Cases Abstracted from the Aero
F'Low Cod(:s 13-36
SI'LFCTIiON AMONG TItI" CN ALTERNATIVE APPROACItES b-36
IJi,_cu,';,'_i.on ok Results B-40
ADI);['PIONAL CONS II)t;]I{ATIONS B-41
tlodUl at" Partitioning B-42
Happing b-42
Proce.,ssor to PL'OCV2,_SOF Transfers I3-44
l:]t.l Modu.k,_' Confittcts on p-q-ordered
Vectors B-44
Non- Ra ridorcme ,;s B- 4 7
I{_-dundancy B--48
CON CI,L]S ION B-- 50
xiiJ
Ch_apter/Sec t ion
C
C.I
C.2
C.3
C.4
C.5
C.6
C.7
C.8
C.8.1
C.8.2
C.8.3
C.8.4
C.9
C.lO
C. lO. 1
C.I0.2
C. 10.3
C.10.4
C.10.5
C.i0.6
C. I0.7
Title
INSTI_UCTION SET AND TIMING INFORMATION
INTRODU CT ION
I)ESCRIPTION OF TABLE C.I
M ICROPROGRAMMAB IL 1TY
COORDINATOR OPERATIONS
FORMATS
AI)DRI¢SS ING
NUMBER OF INSTIIUCTIONS
INSTRUC'PION EXECUTION TIMING
Instruct ion Fetch Timing
Coordinatol_ Tilning
SylacJlron izat loll
E×cept iona [ Cases
INTERRUPTS
SUBI{OUTINE ENTRY AND RETURN
Subroutine Entry
Subt'otlt int, _,-,t urn
Within the ,_;ubroutine
Addressing
Nal-,led Common Mechanism
Arith,net ic D_tails
Other Instructions
C-I
C-i
C-I
C-3
C-3
C-3
C-4
C-4
C-4
C-5
C-8
C-8
C-9
C-9
C-If
C-12
C-12
C-16
C-IO
C-16
C-17
C-18
D.I
D.2
D. 2.1
D.2.2
D.2.3
D.3
D. 4
D.4.1
D.4.2
D.4.2.1
D.4.2.2
RELIABILITY, AVAILABILITY AND MAINTAIN-
ABILITY PROGRAMS D-I
COMPUTER SYSTEM MODEL D-I
MATHEMATICAL HODEL FOR THE DESIGN PROGRAM D-3
Markov Graphs D-3
Conventional Failure and Repair Cycle Model D-5
Design Model for Hardware Elements
Operating Under Software Control D-6
MATIIEMATICAL MODEL FOR TIIE CONFIGURE
P ROG R/H,I D- 9
DEFINITIONS D-12
Program Inputs D-12
Program Outputs D-14
System Availability D-14
System Time Interval Between Failures D-15
FMP RELIABILITY DATA BASE E-I
xiv
= iI
i %
i
_#.• •
Cha_t_/Section
F
F.I
F.2
F.3
F.4
F.4.1
F.4.2
F.5
G
H
H. 1
H. 2
H.2.1
H. 2.2
H.2.3
H.3
H.3.1
H.3.1.1
H.3.1.2
H.3.2
H.3.3
I.i
1.2
1.3
1.4
1.5
1.6
1.7
Title
SYSTEM THROUGHPUT AND UTILIZATION ANALYSIS
SUMMARY
MODEL AND ASSUMPTIONS USED FOR ANALYSIS
ANALYSIS
RESULTS
Processor Loading
File System Activity
FUTURE WORK
FMP FORTRAN EXA,_'PLES WITH ORIGINAL
FORTRAN SOURCE
CONNECTION NETWORK SIMULATION TOOLS
SUMMARY
CONNECTION NETWORK FUNCTIONAL SIMULATOR
Mode 1
Simulator Controls
Simulator Output
CONNECTION NETWORK STOCHASTIC ANALYZER
Model
Input Probabilities
Probability Computations
Analyzer Controls
Analyzer Output
BENES AND OMEGA NETWORKS FOR FLOW MODEL
PROCESSING
INTRODUCTION
ANALYSIS OF '_@O-LEVEL OMEGA NETWORK
WITH INTER-LAYER CONNECTIONS
SKIP DISTANCE ANALYSIS
T_ARD A GENERAL ANALYSIS OF TRANSPOSITION
NETWORKS
PERMUTATION GROUPS AND PARTITION SETS
TERM ANALYSIS FOR RANDOM BLOCKING
STATE OF THE CONNECTION NETWORK
F-I
F-I
F-I
F-17
F-35
F-37
F-39
F-45
G-I
H-I
H-I
H-I
H-I
H-2
H-4
H-13
H-13
H-13
H-15
H-16
H-16
I-i
I-i
I-b
1-7
I-i0
I-ll
1-18
1-20
XV
Chapter/Section Title
J
J.l
J.2
J.2.1
J.2.1.1
J.2.1.2
J.2.1.3
J.2.1.4
J.2.1.5
J.2.1.6
J.2o2
J.2.3
J. 2.4
J.2.5
J.2.5.1
J.2.5.1.i
J.2.5.1.2
J.2.5.1.3
J.2.5.1.3.1
J.2.5.1.3.2
J.2.5.1.3.3
J.2.5.1.4
J.2.5.1.5
J.2.5.1.6
J.2.5.1.7
J.2.5.2
J.2.5.2.1
J.2.5.3
J. 2.6
J.2.7
DESIGN GUIDELINES FOR NASF PROCESSING SYSTEM
SCOPE
DESIGN CONSTRAINTS
Environmental
Atmospheric Cond it ions
Mechanical Stress
Acoustic Noise
Radiation
Static Electricity
Fungus
Electromagnetic Interference Control
Acoustic Noise Control
Input Power
Design and Construction
Physical Characteristics
Cabinets
Size and Weight
Marking
Marking of Equipment
Marking of Controls
Harking of Subassemblies
Accessibility
Grounding
Mechanical Operation
Transportability
Materials, Processes, and Parts
Parts Selection
Wor_nanship
Product Safety
Service Life
J-i
J-i
J-I
J-i
J-I
J-i
J-i
J-3
J-3
J-3
J-3
J-3
J-6
J-6
J-6
J-6
J-6
J-6
J-6
J-6
J-6
J-9
J-9
J-9
J-9
J-9
J-9
J-9
J-10
J-.10
xvi
ri.i
1.2
2.1
2.2
3.1
4.[
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4 .Ii
4.12
4.13
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
ILLUSTRATIONS
Title
Total Cost of NASF Usage
NASF Organization
NASF Operational Envi_'onment
NASF Dusign Center Organization
A Pencil of Four Blocks Taken from a Grid that
Has Been Subdivided into Sixty-four Blocks
Relationship o£ Simulation and
Experimentation
Example of Source Code Formatting
Section of TURBDA
FMP FORTRAN Version of Section o£ TURBDA
Required Order of Statements and Comment
Lines in a Program Unit
_dulo Mapping of Elements of a DOMAIN to
Processors
Example Regions Selected from Domain LAYER
Allocation of Inall-Variable Sets to
Processors
Section of FMP FORTR_ Version of SMOOTH
DOALL Construct Control Flow
Variable U_,, Interpretation
Functional Organization of FMP FORTRAN
Compiler
NASF Job Flow Diagram
General Organization of FMP
FMP Process Block Diagram
P_ocessor Detail Block Diagram
Instruction Fetching and Overlap Diagram
CN Buffer State Diagram
Connections to CN in FMP
Coordinator Block Diagram
Processor Coordinator Fanout Tree Block Diagram
EM Module Block Diagram
EM Fanout Tree Block Diagram
16 x 16 Omega Network
Mapping of EM _dule Number to CN Output Port
Number
1-2
1-3
2-3
2-4
3-10
4-I
4-4
4-7
4-7
4-12
4-13
4-16
4_19
4-21
4-24
4-28
4-51
4-62
5-4
5-10
5-13
5-14
5-22
5-25
5-26
5-34
5-37
5-41
5-44
5-51
xvii
5.13
5.14
5.15
5.16
5.17
5.18
5.19
5.20
5.21
ILLUSTRATIONS
Title
CNSwitch Element
Data BaseMemoryBlock Diagram
FMPPowerSystem
Multilayered Printed Circuit Board for ECLComponentSide of Fully Populated Printed
Circuit Board Assembly
Multilayered Printed Circuit Assemblywith
Dual In-Line Devices and Sockets
Backplanewith Subminiature Coaxial Wire
Island AssemblyMountedin Modulewith BeltedCable Interconnections
NASFHardwareDesignand Implementation
Support System
5-52
5-58
5-66
5-73
5-74
5-75
5-76
5-80
5-83
;/
xviii
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6 .ii
6.12
6.13
7.1
7.2
7.3
8.1
8.2
8.3
8.4
8.5
Scrubbing versus Read-Time Error Correction
FMP Block Diagram with Diagnostic Layers
Super imposed
General Reliability/Availability Systems
Model for NASF
Effects of Redundancy of FMP Mean Up Time
FMP Reliability/Availability Block Diagram
FMP Reliability/Availability Analysis
Lower Bound
FMP Reliability/Availability Analysis
Probable Case
FMP Reliability/Availability Analysis
Upper Bound
Support Processor Subsystem Reliability/
Availability Block Diagram
File Management Subsystem Reliability/
Availability Block Diagram
Support Processor Reliability Data
SUpport Processor Subsystem Reliability/
Availability Analysis Results
Example of Actual Field Data on a Large
System Similar to the Support Processor
Flow Model Processor Showing Functions
Included in Simulation r_del
Functions Simulated in Processor Model
Simple Instruction Timing Diagram, Contrasting
Scoreboard and Queueing Implementations of
Instruction Overlap
NASF Major Activities Schedule
FMP Fabrication and Integration Schedule
FMP Processor Final Design, Fabrication,
and Test
PROHIS Output for FHP Fabrication and Inte-
gration Schedule
PROMIS Output for FMP Processor Final Design,
Fabrication and Test
6-9
6-19
6-26
6-29
6-32
6-35
6-36
6-37
6-40
6-41
6-42
6-43
6-44
7-2
7-3
7-6
8-2
8-7
8-8
8-i0
8-ii
i{ A.2
A.4
' A.5
A°6
i::l A.7
_ A.8
A.9
_" A.10
l.J A.n
A.12
i,_i A.134
A.15
A.16
I !_ A.17
A.18
B.1
_"_ B. 2
c. B.4
B.5
B.6
B.8
i_ B. 9
i; B.lO
B.II
B.12
B.13
!i B.14
B.15
B.16
._:_ B.17
i B.i8
' B.19
'_ B, 20
B. 21
B.22
" B.23
B.24
ILLUSTRATIONS
Title Pa_
Throughput Projection Formula vs. Simulation
Results A-5
FMP FORTRAN Version of _MOOTH A-i0
FHP FORTRAN Version of BTRI A-13
Breakdown of Implicit Code into Segments of
Code and Nodes for Analysis A-17
Calling Tree of Explicit Aero Flow Code
and Segments for _alysis A-25
Summary of Calling Tree for Explicit Code A-26
FMP FORTRAN Version of LX A-28
FbIP FORTRAN Version of FX A-30
FMP FORTRAN Version of TURBDA A-36
FMP FORTRAN Version of OUTER A-37
FMP FORTRAN Version of AVRX A-44
FMP FORTRAN Version of LINKHO A-46
FMP FORTRAN Version of Part of COMP2 A-49
Calling Tree of GISS Weather Code A-51
FMP FORTRAN Version of GDSPCI and FFTFOR A-56
FMP FORTRAN Version of Glassman's FFT
Algorithm A-66
ALGOL Version of Glassman's FFT Algorithm A-68
Example of a Sort Algorithm Using 2 N
Processors /%-70
FMP FORTRAN of Portions of Subroutine CHRVAL
of 2D Explicit Code B-7
Partial View, Benes Network B-8
Benes Network Connecting 8 Processor to ii
EM Modules B-10
Full View of Figure B.I, Details Suppressed
Benes Network B-12
Benes Network with Processor Ports Spread B-12
Benes Network, Both Edges with Ports Spread B-13
Double Omega Network, Two Layers, Each the
Second Half of a Benes B-13
Basic Node B-15
Resting State of Node B-16
Latchup State, One Path, With Data Transferring
to EM B-16
Latchup State, Both Paths, Both Transferring to
EM B-17
Latchup State, Both Paths, Crossed Connectivity B-17
Timing of Transfers Through the CN B-21
Wiring Crossover Hap, 16 Processors To/From
17 EH Modules B-22
Wiring Crossover Haps, Full Size System B-22
Success on First Cycle vs. Hardware Complement B-28
Response to Full and Partial Loading, Stochastic
_alyzer Data B-33
Distribution of Percent Success for p-ordered
Vectorsa Belles Network B-34
Distribution of Percent Success for p-ordered
Vectors, Omega Network B-34
Response to Partial Loading, Case i B-35
Response to Partial Loading, Case 2 B-37
Response to Partial Loading, Case 3 B-38
16-Wide Onega Network B-43
Example of Pileup B-46
xix
XX
C.I
C.2
C.3
C.4
D.I
D.2
D.3
D.4
E.I
E.2
E.3
V.l
F.2
F.3
F.4
F.5
G.I
G.2
G.3
G.4
G.5
G.6
G.7
H.I
H. 2
H.3
H.4
H. 5
ILLUSTRATIONS
Title
Instruction Fetching Hecl]anism
Subroutine Stack
Stack Allocation in the Data Area
Organization of Named C,.mU,lOn
Computer System Avai [,l)i I J ty l_[ock Diagram
Markov Graph of the ])E_IG;I Program Model
Simplified Markov Gral,h lot D,:pletion of
Redundancy
Three-Statu Hodul -- CONI"It_UI.II'] Program
FMP Reliabi.lity Data - l,t_wt,r Bound
PtlP Reliability Datd- l'r_Jbable Case
FMP Reliability Data - Upt,ur Bound
NASF System Throughput and Analysis Model
NASF Resource Relationships
NASF Job Load Scenario
Support Processor Average Load Per Shift
Advanced Support Processor Average Load
Per Shift
Implicit Aero - Smooth
Implicit Aero - BTRI
Explicit Aero - OUTER
Explicit Aero - TURBDA
Explicit Aero - LX
GISS Weather - COMP2 Section
GISS Weather - LINKHO (part of COMP3)
CN Simulator Output, First Example
CN Simulator Output, Second Example
CN Simulator Output, Third Example
Omega Network with 8 Processors and Ii
Extended Mumory Modules
Stochastic Analyzer Sample Output
C-6
C-13
C-14
C-15
D-2
D-4
D-5
D-If
E-2
E-4
E-6
F-2
P-18
P-40
P-41
F-42
G-2
G-6
G-14
G-16
G-18
G-22
G-26
H-5
fl-8
It-10
H-14
H-17
I.IA
I.IB
1.2
1.3
1.4
1.5
J.l
J.2
ILLUSTRATIONS
Title
Benes Network (n=4)
Omega Network (n=4)
Storing a p=5 Matrix in a Prime Number
of EM Modules
A node of level 1
Partition Sets for n=4
A 4 x 4 Switch
Conducted Limits
Radiated Limits (30 Meters)
I-2
I-3
I-4
I-6
1-16
1-22
J-4
J-5
xxi
TABLES
Table
2.1
2.2
2.3
Title
Support Processor CPU}{ours Needed/Hour(Averagedover Day)
NASFData Transfer Requirements
NASFFile SystemControl Activity per Day
2-8
2-9
2-10
4.1 FMPIntrinsic Functions 4-34
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
Processor Characteristics
Execution Unit (portion of processor)Characterist ics
Processor Memory(PM) (part of processor)Characteristics
CN Buffer (per processor) Characteristics
Coordinator Characteristics
Coordinator Memory (CM) Characteristics
Processor-Coordinator Interface
Fanout (Coord-Processor) Characteristics
Extended Memory Module (EM Module)
Character istics
EM Fanout Chal'acteristics
Connection Network (CN) Characteristics
Data Base Memory (DBM) Characteristics
Features of Burroughs CML Circuit Family
5-11
5-15
5-18
5-20
5-27
5-30
5-32
5-35
5-38
5-42
5-48
5-60
5-72
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6 .I0
Error Correcting Code
Single 55-bit Data Word Passing Through
CN with Single Hard Fault
NASF Avallability Analysis
Effect of Redundant Elements on FMP
Reliability
Effects of Component Quality on FMP
Reliability
FMP Reliability Analysis Summary
File Management Subsystem Reliability Data
and Analysis Results
Number of Repair Actions per Week for NASF
System Elements
Corrective Maintenance Labor Hour Estimates
Estimated NASF Maintenance Labor Requirement
6-6
6-13
6-25
6-28
6-30
6-39
6-45
6-50
6-51
6-52
xxii
J
i _¸
tj _,
Table
7.1
7.2
8.1
8.2
8.3
TABLES
Title
FMP Simalation Results
Summary of Simulations of EXPLICIT CODE
NASF Event Identification N_nbers
PROMIS Report Terms
Sununary of NASF Power and Floor Space
Requirements
A.I
A.2
A.3
A.4
A.5
A.6
A.7
A.8
B.I
B.2
B.3
B.4
B.5
B.6
B.7
B.8
7-15
7-17
8-5
8-13
8-15
Characterlzation of Implicit Code Sections A-19
Throughput Computations for Implicit Code A-22
Performance Analysis of Explicit Aero Flow
Code A-32
Throughput Computations for Explicit Code A-34
GISS Weather Model - Benchmark simulation
Results A-43
GISS Weather Model - Benchmark _%aracteristics A-53
Performance Analysis of GISS Weather _del A-54
Summary of FFT Throughput Estimates A-64
Width vs. Layer Number B-9
Summary of Simulator and Analyzer Results for t3_ B-26
Su,mnary of Individual Simulation Runs B-29
Stochastic Analyzer Data, Fraction of Requests
Blocked B- 32
Comparison of Connection Network Options B-39
Worst "p-q-ordered" Cases B-45
Pileup History B-48
Investigation of Non-Random Effects B-49
C.I Processor Instructions
C.l,part 2 Processor operations induced by commands
issued by the coordinator
C.l,part 3 Coordinator Instruction Set
C.2 Processor Instructions
F.I
F.2
F.3
F.4
NASF Operational Scenario Data
Significant Assumptions
Support P_ocessor Characterization
Daily Average: Support Processing CPU
Hours/Hour
NASF Data Transfer Requirements (with COM)
NASF File System Control Activity per Day
F.5
F.6
C-19
C-27
C-28
C-33
F-5
F-16
F-36
F-38
F-43
F-44
xxiii
Table
H.I
I.i
1.2
1.3
1.4
1.5
J.l
J.2
J.3
J.4
TABLES
Title
CN Functional Simulator Input Commands
S_nmary of Node Level Hand Analysis
Skip Distance Analysis for OMEGA Network
Skip Distance Analysis for BENES Network
Simulation of Skip Distances
First Half of Stored Benes
I-8
I-9
I-9
1-12
1-13
Atmospheric Conditions J-2
_echanical Stress J-2
Power Source Description J-7
Power Source Transients, Recovery and Capability J-8
: .q._
I 'll
-_:
_:,i: xxiv
, _..-/
CHAPTER i
STUDY OBJECTIVES AND RESULTS
I.i STUDY OBJECTIVES
The principal objective of the study has been to consider the
feasibility that a facility (NASF), which could support a through-
put well in excess of what would be commercially available, could
be implemented. In particular, the goal is to have a system where
time-averaged Navier-Stokes computation can be performed in 10
minutes or less (on steady fluid flow problems involving a million
grid points). Not only is this throughput goal important, but
since the intent of the facility is to support daily usage by a
large user community, the NASF system availability needs to be
better than 90% and the facility needs to be nominally available
for 22 hours a day. In order that the NASF may support long runs,
the mean time between interruptions should be longer than ten
hours. In some cases, an alternate form o£ the throughput goal
can be used. A sustained, average rate of execution of one
billion floating point operations per sucond (one gigaflop/sec or
I GFLOPS) corresponds roughly to the problem throughput desired on
the aerodynamic flow codes.
The starting point of the effort in this study was thu baseline
configuration developed during the Preliminary Study under
contract NAS2-9456 [1,2]. The overall goal was to gain an under-
standing of the characteristics, capabilities, and potential of
the facility in order to make a judgment as to its feasibility.
_le study required the development of further specifications in
order to consider the responsiveness to the desired application of
the facility and to develop estimates of the schedule, cost, and
risk of such a development.
Both functional and performance (timing) simulators were developed
to be able to estimate (as accurately as possible) performance and
reliability of the system. Although the primary application of
the facility is likely to be aerodynamic £1ow modeling, the perfor-
mance studies included both aerodynamic flow codes and weather
modeling codes. The use of real programs in these application
areas allowed an initial evaluation of the flexibility of the
langu._ge constructs proposea. _is evaluation was especially
important since the facility needs to be sufficiently flexible
that algorithm development could be supported for fluid dynamics
algorithms as yet not investigated. In addition, the diverse user
needs for input, output, and algorithm investigation must be
supported.
Since the development of the baseline systems considered only aero-
dynamic flow modeling applications, the consideration of weather
modeling codes was especially important. This consideration was
used to evaluate the flexibility of the system as far as its
support of other, related application areas and was used to deter-
mine whether further improvements might be needed to support these
additional applications.
I-i
1-2
AlL of the goals could be met by tile system described as a
possible NASF configuration. No hardware modifications would be
needed for weather code optih_ization. Some minor software exten-
sions were proposed based un the weather code evaluations.
1.2 SYSTEM DESCRIPTION
Before describing the system evaluated during this study, the
importance of considering all aspects of the facility must be
emphasized. During the development of the system, the focus tends
to be on the hardware and system software (such as operating
systems and compilers). As shown in Figure I.i, such a focus is
limited. If only the system expense is considered, the other
areas important to the successful utilization of the facility may
be slighted. In particular users themselves face both the
expense of their training in the use of the system and the day to
day expense of developing and using their various application
programs. This usage would include algorithm development, program-
ining, model description data reduction, and so on. The users
must be supported by a staff and whatever other support might be
needed to keep the facility operational. Such support might
include operators, power, cooling, training and supplies.
Although the cor,sideration of all these factors complicates the
development of the facility, these factors must be carefully
considered in order to have a facility that would not only be
economical to acquire but also be economical to use. The system
described below did consider these factors.
USERS
WITH
PROBLEMS 1
MANUFACTURER
USER_ATOR
USERS
=,WITH
ANSWERS
Figure i.i Total Cost of NASF Usage
I+
i°i
[(
!:(
I+!!
i+i
1.2.1 Hardware
The system originally defined during the Preliminary Studies and
modified during this study is shown conceptually in Figure 1.2.
The Flow Model Processor (FMP), which provides the required
computational power, is a dedicated computing engine with an
architecture based on the special needs of modeling. The Supporl
Processor the Peripheral Support System and the File System
together constitute the Support Processing System. The Support
Processing System interfaces with the users, maintains the data
files, and controls the flow of jobs and data to and from the FMP.
Not l_hown in the figure are the support elements including
building, power, office space and cooling.
The architecture of the Flow Model Processor is based on the needs
of discrete modeling and simulation. The FMP, which is described
in more detail later, has 512 processors that normally wouh]
execute independent of, and concurrent with each other. A coordi-
nator is used to allow the processors to execute in synchron-
ism. The processors each have memory space for programs and data.
In addition, a large memory (called the Extended Memory) can be
accessed by all processors through a high-speed network called the
Connection Network. The Extended Memory normally would contain
the data common to the processes being independently evaluated
Figure 1.2 NASF Organization
1-3
I
1-4
each of the processors. Finally a slower staging memcJry (called
Data Base Memory) would be provided to hold the next job, the last
job and the current job. The Data Base Memory buffers programs
and data in order to provide a smooth flow of tasks to and from
the FMP. The memory sizes assumed during the study were based on
the aerodynamic flow codes that are expected to be the primary
application on the FMP.
The Support Processing System would consist of three portions; the
Support Processor, the File System, and the Peripheral Support
System. The Support Processor (the host proce,_sor) would run the
main portion of the operating system (called uhe Master Control
Program). A dual-processor B7800 was assumed for evaluation
purposes. Most of the user interaction with the NASF would be
through the Support Processor. The File System includes disk
packs, an archival store, and the manager of the files. Data
paths to and from the files would exist for the FMP, for the
Support Processor, and for user support. The third element
considered as part of the Support Processing System is the
Peripheral Support System. The Peripheral Support System has been
included because the evaluations performed in the study demon-
strated that at least one of the supportive tasks involved such a
level of work that a special processor for that task should be
considered. In particular the evaluations demonstrated an except-
ionally heavy load can be expected to support Computer Output to
Microfilm (COM). This load may be in excess of 10,000 frames of
graphic information per day. The Peripheral Support System would
include facilities specially designed to support such exceptional
loads in order to improve the load balance across the entire
facility.
I. 2.2 Software
Not _hown in Figure 1.2 is the software which would be used to
support users and to control the efficient usage of the resources
within the facility. A dialect of FORTRAN, called FMP FORTRAN,
has been proposed which has a few simple extensions to standard
FORTRAN. These extensions provide application-oriented approaches
to use both the independent, concurrent mode of operatiun. In
addition, statements are included which are capable of using a
large number of processors at once on a single computation. Since
the Support Processor would be a commercially available processor,
standard languages such as ALGOL, FORTRAN, and COBOL would be used
for process definition on that processor. The File System would
not be programmed by the users, but would provide high-level file
management and access capabilities.
The NASF operating system (called the Master Control Program, or
MCP) would reside, in part, on all elements of the system. Since
the Master Control Program (MCP) would be based on existing
software, the major portion would reside on the Support Processor.
The portion of the MCP on the FMP would manage the flow of jobs
within the FMP and would be the primary focus of con[idence and
diagnostic procedures wi{hln the FMP.
i._.3 Fault Tolerance
Since the FMP will have between 200,000 and 250,000 integrated
circuits, plus other components, both hard failures and transient
failures can be expected. Means for preserving the integrity of
the computation in the face of such failures must be provided.
The level of Large Scale Integration to be used is expected to
bring forth failure modes that have not been important in the
past, such as background radiation which may cause transient
errors in Data Base Memory. Defense against all these possibi-
lities must be included, and has been included in the architecture
described in this report. Where economically feasible, mechanisms
for error correction have been included such as use of single
error correction, double error detection (SECDED) codes in all
memories. To reduce the probability of double errors in those
memories where transient failures may be expected, mechanisms to
"scrub" the memory by rewriting data back into memory with the
errors corrected are provided. For the various types of faults
which can be detected but are not easily corrected, on-line spare
processors and memory modules can be automatically switched in
under control of the MCP to replace failed elements.
Not only was the FMP considered when developing the necessary
fault tolerant aspects of the system. The CPU in the B7800
Support Processor is duplexed, for example, as are the Data
Communications and Input Output Processors. A distributed control
scheme and a multiplicity of disk packs within the File System
serve to keep the system available for useful work without having
each and every one of them available at any given instant. The
automatic recovery procedures in the software not only support the
FMP as mentioned earlier, but exist as a stardard part of the MCP
in the Support Processor.
1.3 NASF EVALUATION
Evaluation of the NASF considered many aspects. Three specific
issues received the major attention in terms of analysis per-
formed. These issues were an evaluation of syszem-level capabili-
ties to support the general work load of the facility, an evalu-
ation of the throughput of the FMP using real programs, and an
analysis of the availability, reliability, and maintainability of
the system. The general approach used for the evaluation and the
results observed is described below for each of these three areas.
As a result of these evaluations and the other work to date, those
areas which contribute to the risks of the program were identi-
fied. These areas, which relate to the assurance of success of
the program are explained below.
1-6
1.3.1 System Utilization Studies
The evaluation of the NASF system organization showed the feasibi-
lity of the system to support the expected workloads. This evalu-
ation was based on a hypothetical, but well thought out, workload
supplied by NASA [4]. System-level models were developed and used
as the basis of the implementation of system analyzer programs.
The models were operationally based so that they may be easily
verified by direct observation of an actual system as development
might progress.
The system-level evaluation included consideration of the
following:
FMP Loading
Support Processor CPU Loading
Average Data Transfer Rates between Files, Users FMP and
Support Processor
Expected number of file management actions such as file
creation, deletion, and accessing.
The results of the evaluation show that the dual-processor B7800
assumed could comfortably handle the expected load with the excep-
tion of the COM support activities discussed earlier. More signif-
icantly, if projection is made to equivalent processors which are
likely to be available before the implementation of the facility,
such processors could handle a significant amount of the COM sup-
port load. The average data transfer rates projected by the anal-
ysis are well below the channel capacities planned. Although more
analysis of peak rate requirements has yet to be performed, the
projections to date are consistent with the expected results.
1.3.2 Flow Model Processor Throughput Evaluation
Throughput of the FMP was evaluated by measuring, in simulation
and by analysis, its performance on complete programs supplied by
NASA. The use of entire programs for measuring performance avoids
a common pitfall in predicting the performance of new and advanced
computers, namely the reliance on throughput evaluations which
look only at the "hard" parts of the problems, which also are by
no coincidence the parts of the problem that the advanced computer
is designed to work best on.
The results of the analysis of the two aerodynamic flow codes
(referred to as aero flow codes) show that the goals for
throughput for aero flow applications are )net. One aero flow
code, identified as the "3D implicit" code was projected to
execute in less than five minutes at a throughput rate of 1.01
billion floating point operations per second. The second aero
flow code, identified as the "3D explicit" code was projected _o
execute in less than seven minutes at a throughput rate of 0.89
billion floating point operations per second. Both codes were
evaluated at the nominal size expected to run on the FMP, specif-
ically one million grid points.
Ii!
/
The results of the analysis of the weather codes shows that the
FMP, as evaluated, is optimized for the weather codes as well.
NASA supplied two weather (or climate) codes. The first was a
version of the Mintz-Arakawa algorithm, as developed by the
Goddard Institute for Space Studies ("GISS" ); the second was a
spectra_ weather code. The same detailed analysis was applied to
the GISS weather that had been applied to the aerodynamic codes.
Fourteen days o_ simulated weather, with 20 minute time steps, in
a 2.5 ° (latitude and longitude) model with a total of 115,334 grid
points, would take 8 minutes to run on the FMP with an effective
throughput rate of 0.53 billion floating point opertions per
second. Scrutiny of the second weather code showed that it could
be expected to run with slightly higher throughput than the GISS
weather, but the detailed analysis was not made.
The analysis was very thorough. All programs evaluated were
dissected into code segments, each of which was internally homo-
geneous. _le throughput was estimated for each individual code
segment. From an analysis of how often each code segment was
executed, the individual throughput estimates were combined into
an overall execution time and throughput rate.
AS a verification of the hand analysis, sections of code were
input to an instruction timing simulator. The code sections
chosen for simulation verified throughput rates ranging from less
than 0.i GFLOPS to more than 1.5 GFLOPS. The instruction timing
simulator was based on a reasonably detailed model of a processor
in the FMP. The instruction times assumed in the model correspond
to what could be expected using good engineering practices and a
modern circuit family such as the Fairchild 100K family of ECL
circuits. The times assumed in the model for access to the common
memory via the Connection Network were based on detailed analysis
of the Connection Network itself. A CN simulator was developed
and used to analyze various access patterns including some taken
from the aero flow codes. A stochastic analyzer was used to
determine the probability of success in making connections. The
stochastic analyzer used probability equations for analysis. Both
methods validate a transfer rate through the connection network of
over one billion words per second from all processors to all
memory modules.
The analysis of the various programs required preparation of _P
FORTRAN versions to be used in the analysis and as the starting
point for hand-compilation onto the instruction timing simulator.
The conversion from the FORTRAN code supplied to FMP FORTRAN was
generally straightforward. In some cases, significant reductions
in the length of the code could be made because of the application-
orientation of FMP FORTRAN.
1-7
4'
2
!
i
1-8
1.3.3 Availability, Reliabilit_ and Maintainability Evaluations
Several methods were used to evaluate the availability, reliabil-
ity, and maintainability of the NASF. The predictions for the FMP
are based upon a computer model of reliability and availability
wi_h assumptions that at, derived from the military standard
methods for estimating rzliability. In an attempt to be as
realistic as possible, field data which included failures due to
system software as well as hardware was used. In addition,
intermittant failure modes were modeled, where the rate of inter-
mittants was based on field experience.
With the fault tolerance mechanisms in place, the availability
forecasts are 99% for the FMP by itself and over 99% for the
Support Processing System. These individual predictions combine
to an NASF availability of over 98%. An estimate of 14.1 hours
between interruptions of processing was also made as a result of
the reliability and availability modeling. These predictions for
the SPS are based on field data for the B7700, which is similar to
the B7800 for reliability and availability.
1.3.4 Program Success Assurance
To assure the success of the NASF project, one must assure success
in all areas. Some areas, being dependent mainly on existing
technology or existing methods, were only briefly addressed during
the study. Other areas of concern, especially where the NASF and
its _P represent a break with past experience, were addressed at
greater length. A discussion of some of the key points addressed
is summarized below.
Although outside the scope of the study, the need for continuing
committment to the successful implementation of the NASF on both
NASA's and the vendor's parts must be carefully considered. The
close technical interaction that was so important to the Prelimin-
ary and Feasibility Studies must be continued. The length of time
from the eventual start of design to delivery of the system is
long. Project attention must be kept firmly on the job at hand.
Continual changes of direction, dilution of effort, and expansion
of goals could make the project seem to have a constant time-to-
completion. This study has shown that a project begun now, with
currently available or imminently expected technology, could
deliver an operational system which would fulfill NASA's
objectives.
Software development could have several potential problem areas.
Software has been notoriously hard to schedule, often because of
incomplete or changing specifications. Software is especially
subject to the temptation to add "just one more little feature"
making the resulting product more and more complex and difficult
to test. This problem must be handled by careful management. The
two major areas of software concern in the NASF are the operating
system, and the language and compiler. The operating system
(called the Master Control Program, MCP) would be based on the
m'q
_sN
Pd,_
NN
_Z
N
N_
_o_.
existing MCP of the B7800 planned as the Support Processor. This
MCP has a history of 19 years of development behind it and is
already being modified by Burroughs to support job flow to the
computational engine for the Burroughs Scientific Processor. With
this work substantially complete, the integration of the FHP
becomes a task with much less risk.
Compiler development is another area often assumed to be a problem
area. Here risk has been significantly reduced by proposing a
language which is essentially ANSI Standard FORTRAN with a
structure surrounding the FORTRAN pieces. This structure allows
the FORTRAN pieces to map directly onto the many individual
processors of the FMP. The result is that most of the compilation
is the same serial FORTRAN to processor-level code process that
industry and Burroughs has considerable experience with. The
coordination between the pieces of standard FORTRAN is simply
described by the added structure and maps easily onto the section
of the FMP specifically designed for such coordination (i. e., the
coordinator).
As a result of the approaches proposed and evaluated during the
study, the success of implementation of the necessary software
seems assured.
Hardware presents no threat to the success of the project. The
technology projections made during the Preliminary Study [2, 3]
are proving to be conservative. Logic design would be straight-
forward and presents little in the way of new challenges. The
organization considered is very modular which would allow
implementation of the system with only a few types of modules.
The one area in the hardware which represents a feature not found
so far in any commercial computer is the Connection Network. This
network provides the necessary data paths between the many
processors and the large, common memory in the FMP. This network
has been thoroughly simulated and otherwise analyzed during the
course of this study.
1.4 CONCLUSION
The work summarized above has demonstrated the feasibility of the
Numerical Aerodynamic Simulation Facility. Although some risks
have been identified, the level of risk is low for the architec-
ture and software considered during the evaluation. This system
is believed to be the best approach to meeting the total system
goals for the NASF. In particular, with thes_ concepts no new
advances, beyond the technology available today, are needed in
order to successfully implement the facility.
1-9
CHAPTER 2
NASF SYSTEM ARCHITECTURE
As indicated in Chapter i, the feasibility study of the NASF
required broad consideration of the total needs of the proposed
facility and of the expected user community. Because of time and
budget constraints, detailed study was based on cc_aercially avail-
able equipment wherever possible. The system architecture used
for evaluation is substantially the same as that described during
the PreLiminary Study [i, 2,]. However, some changes were indi-
cated, based on this feasibility study. The modeling which was
done in support of this study was operationally based. That is,
the system-level models are designed so that they may be easily
verified by direct observation of an actual system. This approach
was chosen to make future verification of the models straight-
forward.
2.1 OPERATIONAL ENVIRONMENT
Before considering the system architecture in detail, it is impor-
tant to first consider how the facility is expected to support the
user community. The planned operational environment of the NASF
has been reviewed in two documents provided by NASA [3, 4]. The
central computational facility (which includes the Flow Model
Processor and a Support Processing System) will be accessed by a
number of users at sites remote from the facility. Some of the
"remote" sites would be physically nearby (such as the NASA Ames
facility) while others would be at distant locations.
For the purposes of the study, some assumptions were made concern-
ing the users. The operational environment described by NASA
shows that many of the users will be directly concerned with pro-
duction use of the facility for design work. These production
users have been assumed, for purposes of the study, to be working
in design teams at "design centers". These design centers have
been assumed to have sophisticated graphics, processing, file
storage, and communications capabilities. These design centers
would reduce the processing load of the facility.
The NASA documents also pointed out that other users will be
involved with code development, method development, and research
in fluid physics and other areas. Some of these users have been
assumed to be associated with the design centers, at least as far
as use of facilities are concerned. Other users would have direct
access to the computational facility from their terminals.
2-i
Figure 2.1 depicts the assumedoperational environment with the
central computational facility of the NASFat the top and with
users having access to that facility eithe_ via terminals or via
design centers. Figure 2.2 depicts the organization of a design
center. All sophisticated graphics equipment was assumed to be
associated with design centers. The processors which are part of
each design center were assumed to provide support to the users
both in terms of graphics I/O operations and in terms of text and
file handling. If the "nearby" design centers are assumed to
support fourteen active users and if the "remote" centers are
assumed to support four active users, the configuration shown in
Figure 2.1 would have at least i00 active users.
The design centers have not been studied further and are certainly
not a required part of the overall system. The main reason for
their consideration was to develop a realistic estimate of the
amount of load on the Support Processing System for text input and
editing tasks. Based on the environment just described, the
fraction of users who require the Support Processor for data entry
and editing was assumed to be 0.2. The other 80% of the users
either use the facilities of a design center, or have terminals
with built-in edit mode capabilities.
2.2 SYSTEM DESCRIPTION
The NASF consists of three elements; the Flow Model Processor
(FMP), the Support Processing System (SPS) and the physical en-
vironment including the building, power, cooling, etc.
2.2.1 FMP
The Flow _del Processor (FMP) is a dedicated, single-user-at-at-
time computing engine which has no I/O capabilities except through
a staging memory. The FMP is based on a large number of indepen-
dent processors, each executing FORTRAN code independently of the
other. The extensions to FORTRAN described in Chapter 4 include
constructs which allow description of significant amounts of inde-
pendent, concurrent operations. In addition, provision was made
(in both the hardware and software) to allow a single computa-
tion to utilize a large number of processors. The FMP also in-
cludes a very wide bandwidth memory that can be shared by all the
processors. The memory sizes assumed for the study were based on
the aero flow programs used for evaluation during the study. More
details are included in Chapter 5.
2.2.2 Support Processing
The Support Processing System serves as the central control, inter-
faces with users and peripherals, maintains the data files and pro-
vides that computational support necessary to keep the FMP effect-
ively utilized. The Support Processing System consists of three
portions; a host processor called the Support Processor, a File
System, and a Peripheral Support System. _st of the discussion
throughout the rest of this report refers only to the Support
Processor (which is the host) and to the File System.
2-2
ii
/i
V-
f:!
!i
il
I"J,
G
(0
I"-'
Z
cr
F-'*
0
I--'
I'1
0
ft"
•,"1 -4
f_
o _o _o
_ - _
mmz
Z_
_ i--,4
>*-
-4
t_ r'-
-(
2-3
L2-4
"-------_ .]_ t>l_ LOCAL
INPUT FROM NASF /
TERMINAL FILE STORAGE
J I PROCESSOR I'ITI QUEUE
Figure 2.2 NASF Design Center Organization
2.2.2.1 Support Processor
The host processor, which is identified as the Support Processor
during this report, was assumed to be a dual-processor B7800 for
the purposes of the study. This processor was chosen for two
major reasons. First, the B7800 system is a new, standard product
which has evolved from the Burroughs 700 and 800 series machines
over the past 16 years. A wide range of data communications and
peripheral support is available on this system. Second, because
the B7800 is an evolutionary system, it supports the Master
Control Program (operating system), compilers, utilities, and
application programs developed by Burroughs for the B6000 and
B7000 series processors. The feasibility of this system for con-
trol of the FMP seems clear since the same functions are already
being implemented for the Burroughs Scientific Processor (BSP)
which also attaches to the B7800 system.
The B7800 employs independent functional processing tu distribute
both intelligence and control among various processing elements.
The B7800 includes five independent functional processors. They
include the central processor, the input output processor, a
memory control processor, a communications processor, and a maint-
enance and diagnostic processor. The configuration assumed for
the study includes redundancy in essentially all elements of the
system, resulting in very high availability.
Since the Support Processor would be the master control for the
facility, most user communications would be supported with the
Data Communications Processor port'on of the B7800. Tlr: configur-
ation assumed for the study included 96 input lines, oI which four
were synchrunous broadband lines (19.2 Kbps - 1,344 Kbps). The
remainde_ were assumed to be a combination of synchronous and
asynchronous lines of various rates (1.2 Kbps to 9.6 Kbls).
In addition to the standard line control disciplines, the B7800
and its Data Communications Processor would provide the capability
for network access and control. This capability would provide the
needed flexibility for potential users to be connected to the
facility "on their terms".
2.2.2.2 Peripheral Support System
In addition to the input/output processors on the B7800 Support
Processor, the study demonstrated that some peripherals would
require a significant amount of computational support. The most
significant of these devices is the Computer Output to Microfilm
(COM) device. The NASA supplied scenario of usage [4] postulated
a very heavy COM load (in excess of i0,000 frames per day) where
the output was assumed to be graphic images. The majority of this
load was for "movies" of complex evaluation results.
2-5
The system utilization analysis (summarized in Section 2.3 below)
clearly demonstrated the impact of the COM formatting, even when
the formatting was only to produce listings of the points of inter-
est rather than graphics control procedures. This load could be
supported with additional central processors within the Support
Processor. Alternatively, the load could be supported by doing
the necessary formatting in the FMP prior to FMP task completion.
A third alternative was to consider a separate system, specifical-
ly oriented to supporting this formatting and the COM device.
No study has been completed with regard to what impact the COM
formatting would have on FMP loading. By studying the Support
Processor loading with and without the COM _ormatting task, it was
clear that one feasible way to support a load of this sort was
with a specialized system specifically planned for that support.
Although further study should be performed, a Peripheral Support
System configured with two high-end minicomputers with special-
purpose software should be capable of handling the required tasks.
2.2.2.3 File System
A separately managed and accessed file system is required as part
of the Support Processing System. The volume of data and programs
which will be moved in and out of the Flow Model Processor togeth-
er with the amount of file management required for the total
system indicate strongly that a separate system be provided for
this purpose (rather than using the Support Processor itself for
example). Secondly, when file management functions are in a
processor different from any processors which may be executing
user progralas, the confidence in security capabilities can
increase significantly. The File System includes the disk packs,
the archlval store, and the file manager. Conceptually, the File
System also includes the Data Base Memory, the staging area for
programs and data within the FMP.
The File System is another part of the facility where a detailed
study has not been completed. Enough is known about the require-
ments of the File System to be confident that such a system can be
configured from essentially standard components. These require-
ments will be summarized below.
The File System should be organized such that many simultaneous
high-speed transfers are possible. The NASF architecture requires
four major connections to the File System; the FMP, the Support
Processor, the Peripheral Support System. and the l]sers. In
addition, the File System would be capable of responding to re-
quests for data movement within the File System and would provide
automatic management of the space.
The FMP requires up to four simultaneous (12.5 Mbits/sec each)
paths to and from the File System, although the use of these paths
is not continuous. The peak requirement of each of the two Input-
Output Processors of the B7800 Support Processor is alst_ 50 Mbits/
sec each, which like the FMP connection, is primarily disk I/O.
2-6
ii
I
I!i
The Peripheral Support System requirements are insignificant by
comparison. The interface from Users to the File Systum would be
for the purpose of accessing graphics data files without having to
funnel such requests through the B7800 Support Processor. Again,
since the User loading would be on the order of the Peripheral
Support System loading, the User loading would be supported well
if the FMP and Support Processor are supported. The technology
for connections of this sort has already beun reported in the
literature [15] and is in production. Equipment to support 50
Mbits/sec per channel is available now. The major thrust of
development, at least at Burroughs, is to significantly reduce tile
cost of this technology or an equivalent.
Tile file manager wouh] be expected to handle approximately i0,000
£ile creations and deletions per day. The File System would
respond to approximately 25,000 requests for [ile accesses per
day. All interfaces to the file system would be in terms of file
"names" rather than physical media position. The File System
would perform dictionary management and storage allocation
functions. Also, the File System would be responsible for data
ownership and access controls.
The analysis assumed a file configuration with both high-speed
storage and lower-speed mass storage on-line. In particular more
than 1011 bits of high-speed disk storage (25 msec average access,
3.6 MByte/sec transfer rate) was planned. More than 2 x 1012 bits
of mass storage (3 seconds average access, 1 MByte/s,:.c transfer
rate) was also planned. Although these appear more than adequate,
the utilization studies described below have not yet considered
what file capacities would be required given tne scenarios sup-
plied by NASA late in the period of the study.
2.3 SYSTEM UTILIZATION STUDIES
The feasibility of the ability of the I_ASF system organization to
be able to support the expected workloads was evaluated. The
evaluation is summarized below and discussed in more detail in
Appendix F. In summary, the system organization described above
would be capable of supporting the workload hypothesized by NASA
[4].
The NASF Utilization document [4] provided by NASA described the
use of the facility in terms of class of usage, called Cases, and
in terms of the sequence of Tasks performed for eacl, job. The
Cases (such as method and code development or design _imulations)
and Tasks are summarized in Appendix F. Before confidence could
be gained that the system could support the projected load, the
committment o£ each system-level resource to each task was care-
fully charted. These event sequence charts identify the sequence
of events needed to implement each task. The charts also identify
those system resources (File System, Data Comm, Support Processor,
FMP, ...) which must interact to implement each event. Samples of
these charts are also included in Appendix F.
2-7
The charts were then used as a model to dc:velop a program which
was used to analyze the impact of the hypothetical workload on the
various components of the system. Some of the data used in the
analysis program was based on a benchmark of a mix of FORTRAN
programs oll a B7700. A known factor of improvement was used to
project expected B7800 performance. In addition, a processor
which might be available about the time that the NASF project
would be implemented wa.,_ hypotllesized and used for eva]ug]tion
purpose:3.
Table 2.1 summarizes the results of the analysis of the Sul,pol:t
Processor loading with and without the COH formatting d[,_cuss___d
earlJ er.
TABLE 2.1
Support Processor CPU Hours Needed/Hour
(Averaged over Day)
Processor With COM Without COM
Similar to B7700
Similar to B7800
"Future Processor"
14.2 1.3
9.5 .9
2.8 .2
In Table 2,1 note that a support processor implemented with the
future processors expected to be available to the NASF project
could support the COM workload with a reasonable-sized system (3-4
central processors).
Table 2.2 summarizes the Data Transfer Requirements averaged over
the day and by shift. Note that these data transfer rates only
show the average rates, not the peak rates needed to prevent the
data path from being a bottleneck. The daily average is over a
full 24 hour day. The data rate (char/sec) assumes 8-bit
characters plus error control.
2-8
H0
TABLE 2.2
NASF Data Transfer Requirements
Support Processor - File System
Support Processor - FMP
Suppor% Processor - Users
File Sjstem - Users
File System - FMP
Dai ly
Average
29,240
.050
4,453
24,260
163,400
RATE (Char/Sec)
Hourly Av._rage
12M- 3am 5am- 5pm
16,678
.08
8,125
45,900
210,032
83,388
.02
228
3,002
294,770
5pm- 12M
35,937
.02
187
1,554
73,770
2-9
2-10
Table 2.3 summarizesthe File Systemcontrol activity by day. The
terms ACTIVE,LONGTERM,and ARCHIVEin the table indicate the dif-
ferent types of files expected to be found in the File System.
Active files are those only recently created or actively used and
would be on the devices with the fastest access times. Longterm
files are those which have been in the active system for up to a
week with little or no use before being copied onto a slower
media. Some files are saved on on-line mass storage, called the
Archive in the table. These files would have an acce:_s time on
the order of seconds but would still be on-line.
TABLE 2.3
NASF File System Control Activity per Day
FILE ACTIVITY
Files Created
Files Deleted
Files Accessed
Files Replaced
ACTIVE
2483
2483
19810
1302
FILE TYPE
LONGTERM
1127
1127
827.7
ARCHIVE
627.3
627.3
118.3
ii
l
The analysis performed to date is sufficient to give one confi-
dence that the system studied would be capable of supporting the
hypothesized workload. Before design can begin, more detailed
studies should be performed to determine more accurate estimates
of grid generation task requirements, the impact of interactive
graphics support tasks and the sensitivity of system support to
all parts of the hypothesized workload.
2-11
CI/
CHAPTER 3
APPLICATION ANALYSIS
3.1 INTKODUCTION
The requirement for performance for the FMP was initially stated
as the execution of the "typical" 3D Navier-Stokes aero flow code
on 200 x 50 x 100 grids in 10 minutes, with the provision that the
FMP should be a flexibly programmable machine that can run a
number of similar applications with similar throughput. These
throughput goals can be restated, with respect to the sample aero
flow codes supplied by NASA, in terms of a more hardware-related
secondary standard of performance, that the FMP should be capable
of achieving a sustained rate of 1.0 Gflops/sec on aero flow codes
that take advantage of its architecture. These goals were met, as
described in more detail below.
3.2 PRODUCTION APPLICATIONS
The statement of work specifically asked for a design that is
adapted to the requirements of computational aerodynamic
programming, with a secondar_ look at the requirements of weather
computations. NASA supplied two examples of aerodynamic flow
codes, identified as the "3D explicit" code and the "3D implicit"
code. In addition, two programs exemplifying the weather applica-
tions were supplied, one being a Goddard Institution of Space
Studies (GISS) version of the Mintz-Arakawa global circulation
model, the other one being a spectral weather code from MIT
(Spectral).
3.2.1 Functional Requirements
The application areas of interest, as exemplified by the codes
supplied, represent a substantially different spectrum of appli-
cations that one would arrive at by questioning all of the users
of very high speed numerical computation.
A general purpose very high speed numerical computing machine must
support a wide variety of precision requirements. For example,
users with sparse and ill-conditioned matrices, such as one finds
in some structures applications, require very high precision, for
some users well over 30 decimal digits. Aero flow and weather
codes apparently will run happily with not more than 10 or 12
decimal places of precision, with much of the computation and most
of the data needing only six or seven places of precision.
3-1
_ ,, _ _ _- v _
=
?
v
%
-2 ¸
o,.
I °i"
i=i
It has been appreciated for two decades that the speed of light
puts an upper limit to the throughput of an uniprocessor, and that
very high-speed machines must use some sort of parallelism or
concurrency in order to achieve throughput. Traditionally, paral-
lelism has taken two forms, first, vector machines in which only
data with extreme regularity could be processed in parallel, and
second, multiprocessors in which many totally independent programs
run in parallel. Because aero flow and weather codes can be
vectorized, a vector machine could be made to work. However, the
vectorization imposes inefficiencies (for example, subroutine
CHARAC in the 3D explicit, or COMP3 out of the GISS weather). As
a result a machine that is efficient only for vectors is often not
efficient when considering all of the programs that one expects
that computational fluid dynamicists would want to write.
Hence, part of the problem is to demonstrate the feasibility of a
flow model processor that is as efficient on vectors as the tradi-
tional vector machine, and is also efficient when the concurrent
processing is on data that does not form vectors. Furthermore,
the language should allow for the description of parallel (or
vector) operations and for concurrent scalar processes which are
independent of each other, or for any mixture of the two.
Demonstration of optimum feasibility of the FMP for its applica-
tion set therefore includes:
- Provision of concurrency (or parallelism) for high throughput
without the requirement for vectorization of the algorithm.
Although the implicit algorithm is easily vectorized, and the 3D
explicit is also easily vectorized, the earlier 2D explicit was
not all easily vectorizable, and a large portion of weather (sub-
routine COMP3) can be vectorized only with difficulty and a large
penalty in throughput.
- A language (FMP FORTRAN) in which one can write either non-
vectorized concurrent operations, or vector operations.
- Word size. The computational fluid dynamics and weather
community requires no more than i0 or 12 decimal digits of preci-
sion, corresponding to 33 or 36 bits in the fraction part of the
floating point word, with some computation and most data requiring
no more than six or siren digits (24 bit fraction part) of preci-
sion. This is not true of other "typical" users of very high-
speed numeric processing. Requirements for precision run from 8
bits, for picture processing, to over 30 decimal digits, for users
with large, sparse, ill-conditioned matrices, typically structures
and applications. A large number of scientific processor users
desire 14 to 16 decimal digits of precision.
- A language based on FORTRAN to accomodate the applications
programmers, who, in the computational fluid dynamics and weather
communities, have mostly been used to working in one dialect or
another of FORTRAN.
3-2
%
Z
'zY ' ";
z'_"
3.2.2 Pro_ected Performance, Summar_
For the aero flow codes, the FMP here described would run the 3D
implicit in 6 minutes and 16 seconds (i00 times steps) at a
throughput rate of 1.01 Gflops/sec* during that time. The 3D
explicit runs in 8 minutes and 52 seconds (again, for a test case
with i00 time steps) at a throughput rate of 0.89 Gflops/sec
during that time. In both cases, the mesh had a million grid
points (i00 x 50 x 200 in the case of the implicit, i00 x i00 x
i00 for the explicit). Feasibility is therefore demonstrated.
Other metrics can be used to describe the "raw" throughput, of
which the above is the net:
2.22 Gflops/sec would be the maximum throughput rate given that
operations are alternately add and multiply.
1.74 Gflops/sec would be the maximum throughput rate for register-
to-register operations using the instruction mix derived from
analysis of the aero flow codes.
1.33 Gflops/sec would be the throughput rate seen in about half of
the sequences of code submitted to the simulator.
Of the above, the figure of 1.33 Gflops/sec represents a through-
put rate achieved by a number of real sequences of code, taken
from both aero flow codes and from weather. It represents an
achievable throughput for "friendly" applications.
All of the above refers just to the FMP. The throughput of the
NASF is just as much dependent on proper function of the Support
Processor System (SPS) as it is on the FMP. The SPS, however,
presents well-known problems, not unique problems for the partic-
ular set of applications.
3.3 PERFORMANCE PROJECTION BASED ON BENCHMARK PROGRAMS
3.3.1 Summary
The four programs used as benchmarks in evaluating the design
were:
- NASA 3D implicit aero flow code supplied by Ames
- NASA 3D explicit aero flow code supplied by Ames
- GISS weather code, in several different versions
- Spectral weather code from MIT
Evaluations of the first three were comprehensive, resulting in
the projections of 1.01 Gflops/sec and 6 minues, 16 seconds, for
the implicit, 0.89 Gflops/sec and 8 minutes, 52 seconds for the
*Gflops/sec, Billion floating point operations per second.
3-3
explicit at the size of the benchmark, and 0.53 Gflops/sec and 4
minutes, 25 seconds for the GISS weather code. Appendix A
discusses these evaluations in detail. These evaluations and the
conditions leading to these conclusions are summarizod in this
chapter.
The implicit code achieves the 1.0 Gflops/sec being used as a
guide for evaluating adequate throughput rate. The explicit code
nearly does. Since the intent of the explicit is to be computa-
tionally more efficient than the implicit, the performance goals
are deemed demonstrated.
On GISS weather, the non-vectorizable portions of the code exe-
cuted at more than one Gflops/sec (subroutine COMP3), while the
throughput rate observed in vectorizable portions (COMPI and
COMP2) was reduced by EM accessing and memory-to-memory moves that
produced no floating point operations.
Examination of the spectral weather shows that the fluid dynamics
portion should run with higher flop rate than the fluid dynamics
portion of the GISS weather (COMPI and COMP2), and that the chemis-
try and physics portions were essentially identical to COMP3.
Hence, the spectral weather is expected to run at a higher flop
rate than the GISS weather.
3.3.2 Me thod
The method used for performance evaluation was generally the same
for all of the first three benchmark programs. Because of time
and budget limitations, only a cursory look was taken at the
Spectral weather code.
Throughput was analyzed on the basis of FMP computations. I/O
operations were ignored. Transfers between DBM and file system
are independent of, and go in parallel with, the FMP computation.
It is assumed that the file manager stages the next job, and
unloads the last job, in times which are completely overlapped
with current computation. DBM-EM transfers are also ignored,
since they go on concurrently with current processing as long as
EM space is available and take negligible time. The 15 million
words of a restart point of a typical aero flow code are loaded in
0.375 seconds, which can be compared with the 600 seconds duration
of a typical run. Therefore, both system I/O and user I/O were
ignored.
Each program was analyzed to find the calling tree of the sub-
routines, and subroutines were divided into sequences of code that
were internally similar. Analysis was performed on each indivi-
dual sequence and the results combined, taking into account proces-
sor utilization percentages and number of exceptions, into total
figures for each of the benchmark programs. Thus the analysis
included every line of code in the first three programs.
3-4
I!!
Analysis consisted of hand compilation and simulation for a select-
ed number of code sequences, and estimates based on interpolation
between known simulations for the rest of the sequences. In the
implicit aero flow code, over 60% of the computations of the
program are within the inner loop which was simulated. For those
sequences which were not simulated, a formula was developed which
interpolated between the simulated sequences. For a more detailed
description of this method, see Appendix A. Various cases in
which exception should be taken to the formula were also taken
into account. It was found that almost allof the simulation
results could be empirically fit by a formula of the form
T =(k I F + k2M + k3D)/P (3.1)
where T is the time required to execute a particular code segment,
F is the number of floating point operations in that segment, M is
the number of EM accesses, and D is the number of divisions over
and above the 2% divides assumed by the "standard" instruction
mix, and p is the number of processors processing. The constants
k I and k2 are determined empirically from the simulation results,
and k 3 was set equal to the time of a divide instruction. Through-
put for the individual code segment is given by F/T.
The hand compilation made certain assumptions about the compiler.
Assignment of instances of the DOALL to processor was not optim-
ized, but done by the simplest algorithm conceivable (see Chapter
4 for software discussions). Optimization steps such as the
substitution of an add or subtract from exponent to replace a
multiplication or division by a power of 2 were assumed.
3.3.3 Throughput of Aero Flow Codes
The implicit aero flow code, for which simulation covered over 60%
of the computations, was estimated to have a throughput rate of
1.01 Gflops/sec at the 100x50x200 size and ran i00 time steps in 6
min, 16 sec. The implicit code showed an estimated throughput
rate of 0.89 Gflops/sec at the i00 x I00 x i00 mesh size and ran
I00 time steps in 8 min. 52 sec. Details are in Appendix A.
The language being considered, FMP FORTRAN (described in more
detail in Chapter 4), was found to fit the aero flow codes very
conveniently. A simple, one-to-one translation from FORTRAN codes
provided into FMP FORTRAN goes as follows. All arrays subscripted
with the grid variables are made GLOBAL. DO loops (single or
nested) on the grid variables are automatically turned into DOALLs
as long as the data dependence allows it. Temporary variables are
allowed to be LOCAL by default. The implicit code, as supplied by
NASA, is of such regularity that practically all of it can be
transformed into FMP FORTRAN using such simple rules. Because of
this, and in order to save time, most of the FMP FORTRAN versions
of the aero codes were not even written down, since they are
obvious from the FORTRAN versions provided by inspection.
3-5
During the hand compilation process it was found that translation
from FMPFORTRAN to FMP machine code was simple and straight-
forward. This gives confidence that the compiler will be rela-
tively simple to write.
Procesor utilization ranged from 93% (in the explicit) to an
average of 97.4% (in the implicit). Some routines gave 99.9% pro-
cessor utilization. All subroutines and other code sequences were
included in the total time and total number of floating point
operations. In neither aero flow code did any of those sequences
with low processor utilization have any influence on the final
throughput estimate.
3.3.4 Weather and Climate Codes
Two benchmark programs were supplied by NASA Ames for use in
evaluating the performance of the FMP for weather and climate
codes. The first, a Goddard Institute of Space Studies version of
the Mintz-Arakawa global circulation model, came in several differ-
ent versions written for several different machines. These
various versions are seen to have variations in portions of the
algorithm. The version analyzed was one written for the 360/195.
This is the same version that had previously been used as a test
case for analyzing BSP throughput.
The second is a "spectral" weather code from MIT, in which an FFT
is used to regularize the hydrodynamical computations.
The GISS code was analyzed at an intermediate grid size (2 o lati-
tude steps, 2.5 ° longitude increments along the equator with 20
minute time steps). The program consists of an easily vectoriz-
able fluid dynamics section (subroutines COMPI and COMP2 and the
subroutines they in turn call), and COMP3 and its callees, the
physics and chemistry section. The average throughput rate for
the entire program was determined to be 0.532 Gflops/sec with a
14-day simulation taking 4 minutes, 25 seconds of FMP time.
The GISS climate code demonstrated the advantages of the FMP
architecture. The vectorizable portions tended to run slow
because of many EM accesses, but COMP3 and its subroutines ran as
independent scalar processes in parallel in all the processors,
achieving over 1.2 Gflops/sec for the portion simulated. COMP3
and its subroutines have been shown to be hard to vectorize for
existing vector machines, whereas it is not necessary to vectorize
them for the FMP.
In this benchmark, substantial use is made of parts of the lang-
uage that see little or no use in the two aero flow codes,
including:
3-6
- Domaindefinitions using domainexpressions that
include previously defined domains
- INALLarrays
- Initialization of values in declaration by arithmetic
expressions
- NEXTDO
- Branching within DOALLS
This is a worst-case analysis, in that any data dependentbranches
were assumedto demandthe most computations. This approach was
used in order to estimate the worst-case maximumrunning time of
the GISSclimate code. Other conditions which simplify the radi-
ation calculations (such as the existence of cloud cover) will
result in fewer floating point operations, and shorter times.
Whether the Gflops/sec rate would go up or downunder these condi-
tions dependson whether flops or elapsed time is reduced propor-
tionately more. This case wasnot analyzed.
The spectral weather code is expected to run with substantially
higher throughput than the GISS climate code does. Its fluid
dynamics portions are done by spectral analysis, with each
processor processing an FFT independently of all other processors.
Thus, the fluid dynamics computations are much more locally
contained, since all the intermediate results in the FFT can be
contained within processor memory (declared either INALL or LOCAL)
and should run faster. The chemistry and physics portions of the
spectral weather code are substantially identical to the chemistry
and physics portions of the GISS climate code, and the analysis of
one can represent the analysis of the other.
3.3.5 Applications Beyond the Benchmarks
The analysis summarized earlier in this chapter and in Appendix A
demonstrates the applicability of the FMP described in Chapter 5
to aero flow and weather codes. This analysis is therefore a
constructive demonstration of the feasibility of the NASF. The
FMP as described has broader applicability than to applications
similar to the four benchmarks, as the remarks in this section
will indicate. The following are considered:
- Single FFT (In the spectral weather code the 512 processors
do 512 FFT's concurrently)
- Sort
- Problems too large to fit in Extended Memory (in "core")
3-7
In Section A.6 of Appendix A, the FFT is discussed. That section
shows that the FMP runs various sizes of FFT at throughputs
varying from 0.5 through 0.7 Gflops/sec. The reduction from 1.3
Gflops/sec is due to data rearrangement.
Section A.6 also discusses a method for achieving concurrency in
sorting keys or data elements that are contained in processo_
memory. The particular method shown runs at 100% processor
utilization when sorting an array of elements that start out being
sorted in inverse order, such a case being a kind of worst-case
test for some sorting algorithms.
3.3.5.1 Large Problems
The "standard" scenario for the use of the FMP is that all files
necessary for the use of an FMP task are in place within DBM at
the time the task is started. During the course of a run, the
task is essentially self-contained within the "main memory",
namely EM, CM and PM. This does not preclude reading from DBM to
EM, or writing to DBM from EM at appropriate times. A set of
files are located in DBM at the start of the run. Files may be
created within DBM during the course of the run, snapshot dumps
for instance, and when such files are closed by the FMP program,
the file system has the option of moving them out of DBM before
the run is finished. The concept of having the high-speed computa-
tions contained within a bounded portion of the hardware, here the
FMP, with no interaction with external devices such as the support
processor, has been given the name "computational envelope".
However, the computational envelope is not completely sealed even
during the "standard" scenario.
Another scenario is the running of tasks that will not fit in main
memory. The following questions are considered. First, what
facilities should be available with the initially delivered com-
piler; second, what facilities are envisioned for possible later
implementation; and third, what problem properties allow efficient
operation for problems that do not fit in memory.
The system evaluated during this study does not have the system
software required for automatic virtual memory management for
taking care of overflow from EM. Hence, the user programmer will
have to insert READ and WRITE statements for access to DBM files.
As with any other direct I/O scheme, it behooves the programmer to
initiate I/O ahead of time, and test for completion at the point
of using it. Only one direct I/O operation can be going on at one
time. If a second direct I/O is called for before the first has
finished, the program would wait for the first I/O operation to
finish before initiating the second. User processing would be
suspended until that second direct I/O is started.
3-8
!r!
The FMP hardware, which is described in more detail in Chapter 5,
is intended to be able to support a virtual memory mechanism
whereby certain Extended Memory (EM) files can be held in Data
Base Memory (DBM) when EM does not have enough space. EM space
would be dynamicall_ allocated, and addressed with base registers,
hence one possible [_p_ementation of such virtual memory is to
have the base addresses of non-present data point outside of
actual memory spa_e. The address-out-of-bounds interrupt would
trigger transfers between DBM and EM, plus some processing to fix
up base addresses.
The system considered during this study does not have the system
software to implement such a virtual memory scheme. This lack of
automatic virtual memory management did not impact the throughput
studies of the aero codes and GISS codes since these benchmarks
will be able to reside within the planned EM space.
Hardware mechanisms that allow virtual memory for Processor Memory
(PM) space using EM memory space to back up the PM space should
also be planned. Methods for supplying this feature are still
under discussion. One suggestion is that EM module No. P could be
assigned to processor No. P, giving each processor its own private
EM module for back-up for virtual memory purposes.
Some of the characteristics of an aero flow code that would exe-
cute satisfactorily, even though it would not fit in the Extended
Memory (EM) can be determined by analysis independent of the
method used to extend the storage capacity for problems into the
Data Base Memory (DBM).
The following discussion is based on the 3D implicit aero flow
code, whose major computational effort is in subroutine STEP and
its subroutine BTRI. A listing of BTRI in FMP FORTRAN is included
in Appendix H.
The bandwidth between EM and DBM (detailed in Chapter 5) is 50
million words per second. Since data overlay requires moving idle
data out to make room for tLe new data, half of this, or up to 25
million words per second, is the rate that files can be brought in
to be worked on. If the throughput rate (1.0 Gflops/sec) is to be
maintained, there must be 40 or more floating point operations for
every word brought into the EM from DBM. This goal can be met and
then some. As an example, consider the following demonstration of
one way of programming a 220 x 220 x 220 3D implicit aero flow
program.
The data is blocked into 64 blocks, each 55 x 55 x 55. The organ-
ization of these blocks is shown in Fig 3.1. At any given time,
four blocks forming a "pencil" in one direction will be in EM.
Computation sweeps from one end of the pencil to the other and
back again, so that having anything less Khan a pencil in EM will
increase the amount of overlays between EM and DBM dramatically.
Analysis of subroutines STEP and BTRI shows that there are about
84 floating point operations on each datum, larger than the 40
required for the desired throughput.
3-9
/I
/
!
/
i
_igUre 3"i A PenCil of _our
ubdiVlded into Sixty.fou r Blocks
Blocks Taken _rom a Grid that Has Been
Memory requirements are dominated by the pencils. It is conven-
tional in such overlaying situations to have one pencil in "core"
being computed on, one pencil's worth of space alloc_ted to the
next pencil being brought in, and one pencil's worth of space
allocated to the newly created data that is being written out.
Since only one transfer is going on at a given time, in the pre-
sent instance it may be possible to use the space vacated by newly
created data to contain the next pencil, so that only two pencil's
worth of space are needed. Assuming 15 variables per point, and
665,500 mesh points per pencil, three pencils (the worst case)
would occupy 29,947,500 words in EM; two pencils (the more likely
case) would occupy 19,965,000 words in EM.
Although the EM-DBM data transfers are completely hidden behind
computation, and do not slow down the throughput, there will be a
throughput reduction from the 1.01 Gflops/sec analyzed in Appendix
A from another cause. Not all the arrays declared LOCAL in the 3D
implicit of the analysis, will fit in processor memory. Some of
these arrays will have to be held in EM, where the access time is
longer. Alternatively, recomputation can be used to avoid the
saving of precomputed results.
After sweeping 16 such pencils in one direction, direction is
switched and 16 pencils are swept in the second direction, and
then in the third.
Virtual memory machines have been on the market for 19 years at
least; the Burroughs B5000 is an early example. All of the commer-
cially avai]able virtual memory mechanisms show varying degrees of
throughput reduction when the data base for the problem is larger
than the main memory of the machine. When the programmer controls
his own direct I/O, there is the opportunity for favorable cases,
such as the implicit aero flow above, to achieve full machine
throughput on problems too large to fit in main memory.
3.3.6 Application Domain
The primary area of application of the FMP, according to the state-
ment of work, will be the aero flow codes. A secondary area of
application will be the weather and climate codes. Analysis of
the benchmark programs shows that for reasonable grid sizes, the
desired throughput is achieved. The range of problem sizes for
which the throughput applies is analyzed here, as well as what is
the largest problem that will fit in the DBM using the approach
described in the previous Section 3.3.5.
In the aero flow code, the smallest "good" grid size is that which
permits two dimensional DOALLs to run with reasonable efficiency.
Hence, the smallest grid has a single dimension of not much
smaller than_. A grid of 22 x 22 x 22 is the smallest that
runs with 94% processor utilization or better. The largest acre
flow code is the largest one whose data base will fit in EM.
Assuming fifteen variables per grid point, and 225 words in the EM
3-1]
i i 3-12
address space, that is a grid of about 2.2 x 106 grid points.
Other EM space requirements reduce the figure. The largest pro-
gram that will flt in EM and DBM using direct I/O has a complete
data base allocated to DBM. At the currently specified size of
DBM, namely 227 words, this is an upper limit of about 9 x 106
grid points. If a larger upper limit were required, the size of
DBM could easily be increased.
In the weather and climate codes, the grid has a much smaller
dimension in height than it does in latitude or longitude. Not
all weather code grids are in terms of longitude and latitude;
other two-dimensional grids can be mapped onto the surface of the
earth as well. In the GISS climate code, as translated for the
FMP, almost all of the DOALLs are on a single layer. Thus, as
long as that layer is nearly 512 elements, the bulk of the computa-
tions will be done with good processor utilization. The smallest
"good" weather problem would be one with 15 ° longitude spacing
along the equator, a grid of 20 x 24 in each layer.
Subroutine AVRX of the GISS code represents a non-negligible
portion of the computation. Appendix A describes five different
ways that AVRX may be mapped onto the FMP. Any one of the five
ways will work, but all have some drawback. The throughput of
AVRX will be poorer at the smaller grid sizes, and the preferred
implementation may vary as a function of grid size. Hence, the
throughput estimates of Appendix A (0.53 Gflops/sec), will have to
be revised somewhat for different grid sizes to take into account
the effects of AVRX*. At the grid size of 89 x 144 analyzed in
Appendix A. AVRX was 2.2% of the running time, executing at 0.065
Gflops/sec.
The largest weather code that will fit in memory, assuming 16
variables per grid point, would be a grid size of about 432 x 268
x 15 levels, or roughly three times the resolution of the case
explored. Alternatively, a grid size of 512 x 320 x 12 would also
fit. As with the aero flow codes, a grid with four times as many
points will fit into DBM, say 864 x 536 x 15 or 1024 x 640 x 12.
Running time on these latter codes would be quite long. Doubling
the resolution roughly raises the running time by a factor of 8,
assuming the same number of levels, since the spatial resolution
and the time resolution are roughly proportionate. Hence, with a
grid size of 432 x 268 x 9, which is triple the resolution of the
analyzed case, 27 times the running time of the 89 x 144 x 7 grid
size analyzed is expected. At this triple resolution, a fourteen
day run, with a 7 minute time step, would take roughly two hours,
based on multiplying 4 minutes, 25 seconds by twenty seven.
_A rough estimate of 0.36 Gflops/sec for the 20 x 24 size is
arrived at by the following approximations. AVRX throughput
(0.065 Gflops per second) and running times are assumed the same
for the small grid as for the analyzed case. For the rest of the
code, throughput is assumed the same, but the running time was
reduced by a factor of 26, since DOALLs drop from 26 cycles to one
cycle. The result is one twenty-sixth as much useful computation
done in 6.08% of the time.
iI
CHAPTER 4
SOFTWARI_
4.1 INTRODUCTION
The primary uses of the NASF are expected to be design and model-
ing applications. These applications can be approached either by
experimentation (such as with wind tunnels) or by simulation.
Figure 4.1 shows the relationship of these two approaches. The
NASF is expected to support the abstraction of the "Real World"
with some mathematical system. Mathe,natical conclusions will be
established as a result of the simulation and these conclusions
will then be interpreted to determine the desired physical
conditions.
The abstraction process represents the development of algorithms
to model real-world situations. The NASF should providu tocls and
support to assist in this abstraction process. The system con-
sidered in this Feasibility Study would provide support for the
abstractio,_ L)_ocess both with simple extensions to the well-known
FORTRAN language and with an interactive system which can be used
to observe the results of the use of the model.
ABSTRACTION
MATHEMATICAL iSYSTEM
E
X
P
E
R
I
M
E
N
T
PHYSICAL
CONDITIONS INTERPRETATIONS
S
I
M
U
L
A
T
I
0
N
I
I MATHEMATICALCONCLUSIONS
Figure 4.1 Relationship of Simulation and Experi'nentation
4-1
_e simulation process would also be supported with the language
extensions. The Support Processor and the File System would be
used with the FMP during the simulation process to provide the
same careful controls and monitoring needed during an experi-
mentation process. The results of simulations would be observed
through use of the various NASF user facilities (printers, gra-
phics terminals, COM, etc.) for interpretation by the users.
Where the results of experiments might be available on the
facility, comparisons between simulations and experiments would be
made.
These processes (abstraction, simulation, and interpretation)
require use of most of the components planned for the NASF. The
system-level components were already described in Chapter 2.
Chapter 5, which follows, discusses the Flow Model Processor (FMP)
in detail. This chapter concentrates on the system-level software
required to support these processes. The most direct software
support of users comes from some means of describing the mathema-
tical system which is the result of the abstraction process and of
controlling the simulation process. In the NASF, the language
used to define processes on the FMP provides the support required.
Other forms of software support are the Master Control Program
(the operating system which controls all parts of the NASF), the
File System Control Software, Intrinsics, and Test and Diagnostic
Support Software.
4.2 FMP FORTRAN
The language considered as a means for supporting the design and
modelling applications on the NASF is a dialect of FORTRan. This
dialect is based on ANSI Standard X3.9-1978 [I0] and includes a
few extensions which are appropriate both to implementation of the
models and their simulation.
The description of the FMP FORTRAN presented here is substantially
the same as that actually used during the application analysis
(see Chapter 3). The language constructs presented are particular-
ly oriented to describing a set (or collection of sets) of dis-
crete processes which may be used to define the desired mode]s.
This simple set of constructs seems sufficient to support the
applications planned for the NASF.
4.2.1 Language Design Considerations
The design of a language must be concerned not only with the
utility of its use for applications, but it must also consider
problems of complexity, of implementation on the hardware of
interest, and of debugging and verification capabilities.
4-2
v17
4.2.1.1 Complexity
As the capabilites of available hardware expands, the uses of the
hardware have expanded to the point where the software has become
extremely complex. The development of a new language to support
the NASF community therefore must consider the problems of com-
plexity. Since programs are mort often read than written, the
language source becomes important Jm two forms of communication,
one with other users and the other with the NASF. The most impor-
tant concern then is to try to achieve a match between the appar-
ent complexity in a program and our human ability to deal with
that complexity. The language constructs described in the follow-
ing sections have been chosen to highlight the essential major
ideas of a model while using "standard" FORTRAN to define the
details. Some of the complexity of the "standard" FORTRAN will be
removed by optional automatic formatting of the source listings.
For example, the section of SMOOTH shown in Figure 4.2 shows how
indentation can be used to clarify the scope of the various
control structures such as DO and IF.
Although the design of a language cannot provide the desired
simpl_city automatically, the constructs can be chosen to
influence programming style in the desired way. Therefore, the
const*ucts chosen in the language extension are few and general in
nature. The programs written using these constructs should be
easily understood, hopefully even more understandable than the
same algorithm expressed in serial constructs. This result is
expected because each part of a program can be kept conceptually
simple and because the relations between the parts of the program
are kept simple. The constructs chosen also make the
representation of discrete models more natural. These constructs
should allow simplification of the abstraction process without
losing the ability to make efficient use of machine organization.
In addition, subsequent modifications to the abstractions should
be simpler.
4.2.1.2 Abstraction and Modeling
Before considering the proposed language constructs, the abstrac-
tion and simulation processes of Figure 4.1 should be discussed in
more detail. The problems faced in the practical use of the NASF
will l,e how to abstract the real-world systems of interest and how
to control the simulation of such systems so that the results
would be a meaningful adjunct to experimental results.
Since a digital system cannot directly model a continuous process,
the abstractions must be to some discrete-._stem representation.
The first step in such an abstraction is to identify the structure
(and substructure) of the model. For example, the geometries or
grids of interest would be defined. Then the information of
interest throughout the model would be identified. Such informa-
tion is usually called "state" information since it describes the
current state (or value) at the point of the model with which it
is associated. For example, when studying air flow around an
object, wind velocity, wind direction, and pressure may be o£
4-3
3400
3500
3600
3700
_800
3900
4000
4100
#200
4300
44u0
4500
4600
qTO0
4800
4900
5000
4-4
XF (K,EG,_ ,DR, _.E_.KHRX-_) THKH
T£=e(J_KTZ_L_b)
DO 4 N:1_5
S_ = SS + 0o5_$HUxkG(J_K+I_L_I_.=xT1 + Q(J_K-I_L_N}xTZ -
_oXCT_N))ATEHP
COUTXNUE
ELSE
T3=O(J_K_Z_L_G_
D_ 5 N=_5
_.X_<J_KTz,L_H)_T_ _ Q(J_K-_L_H)XT4) G.xCT(N_)xTEHP
CBI4TINUE
ENDIF
Figure 4.2 Example of Source Code Formatting
]i
!:!
i!
!!
interest at each point of the model. At the same time, some state
information may apply to the entire model (such state information
is invarient over the model). For example, the Reynold's number
of the fluid flowing around an object might be assumed to be the
same everywhere. If such an assumption is made, one value can be
used to represent the entire model.
Real systems of interest are usually not static. They show some
"behavior" over time. Such behavior is observed through the state
information. Thus some means must be provided to describe the
process by which the state at a particular point in the model
changes over time. Conceptually, such a process exists at each
point in the model. Note that these processes are concurrent.
The language constructs chosen below provide means of describing
both the spatial relationships (geometry and state) and the
temporal relationships (processes) in a model. In general, stan-
dard FORTRAN constructs are used to define the process of state
change at each point in the model while a new construct (called
DOALL) is used to identify the natural points of concurrency.
Although the normal FOKTRAN variable and array mechanisms are
available to describe the state of the model, two additional con-
structs are defined which are intended to make the abstraction of
models more straight forward as far as geometry and state vari-
ables are concerned and which assist in efficient usage of the
storage of the FMP.
4.2.2 Language Constructs
The language called FMP FORTRAN is based on ANSI FORTRAN 77 (X3.9-
1978) [i0] with extensions and modifications to improve its
utility for use for the planned applications and to allow effi-
cient use of the projected hardware. FMP FORTRAN is expected to
implement all of the features of ANSI FORTRAN 77 except that
CHARACTER type, all usage of CHARACTER type, and Input/Output
Statements are as defined in the subset FORTRAN in the ANSI
document [i0].
The additional constructs described in the following sections have
been motivated by the abstraction and simulation functions already
described. The three major areas discussed are geometry, state of
the model, and process modeling. As with other parts of the
system approach discussed in this report, areas for continued
improvement certainly exist. However, the language constructs
reported here were sufficient as far as the specific application
programs considered are concerned.
4.2.2.1 Introductory Example
To introduce the basic concepts of the language extensions, a
simple example will L_ considered first. The sections which
follow will explore each of the major areas in more depth.
4-5
Figure 4.3 shows the main computation section of IURBDA (from the
explicit aerodynamic flow code). Note that there are three nested
loops. Also note that the computation inside the loops is
independent of the nesting order. The variable CVI _s used each
time the inner loop is evaluated without change.
Now consider Figure 4.4 which is a corresponding FMP FORTRAN
version of the code in Figure 4.3. Note that the nested DO loops
are replaced with a statement called "DOALL" at the start of the
loop and with a statement called "ENDDO" at the end of the loops.
This version would execute exactly the same as the original
version if only one processor is available. However, the DOALL
construct gives the specific information that the computation of
the inner loop for each combination of I, J, and K values is inde-
pendent of all other I, J K combinations. In other words, if
enough processors were available, all IL*JL*KL instances of the
code in the inner loop could be computed concurrently. In this
case, there would be IL*JL*KL copies of the inner loop code (one
copy per processor). Execution of the DOALL statement would acti-
vate all IL*JL*KL instances simultaneously, (one per processor),
each with its own set of I, J, K values. After all instances had
completed, execution would continue after the ENDDO statement,
just as control passes the CONTINUE statement in the original code
when all loops are complete.
From an applications standpoint, the DOALL statement identified a
grid over I, J, and K. The arrays EI and RMUL have one element
(of "state" information) corresponding to each point of the grid.
The variable CVl is a "global" state variable. The code between
the DOALL and ENDDO statements describes the process of changing
state variables from the old set of values to a new set of values.
Note that the process is logically different along the J=l and K=I
planes. The evaluation of the code for a specific combination of
I,J,K values is called an instance.
The compiler is informed of the usage of variables in this case
with the last part of the DOALL and ENDDO statements. The vari-
ables and arrays listed after the word USING in the DOALL state-
ment and after the word GIVING in the ENDDO statement identify the
state variables (usually in Extended Memory).
Before considering the details of each of the constructs, consider
how each of the instances execute and use memory. In the case of
IL*JL*KL processors, all processors begin execution, each on its
instance. Each processor has a copy of the code and executes out
of its own local storage. CVI and the array EI(I,J,K) would be
referenced in the common Extended Memory. The variable TEMP is
completely local to an instance. Therefore, each processor would
have a storage location for TEMP as used in the instance executing
on that processor. The resulting array _IUL(I, J, K) would also
be it, Extended Memory. Since all IL*JL*KL instances need the
value of CVI, space would be allocated in each processor for CVI
and the value would be broadcast to all processors. This approach
costs a little storage and would save IL*JL*KL -i references to
the slower Extended Memory.
4-6
go0
_000
_150
i_00
1300
£600
1800
#3800
#3900
q4000
##£00
##_00
q_300
#4_00
##500
44600
CU£=I./CV
DO i E=I_KL
Og i J=I_JL
DO i Z=i_IL
TEHP=RBS(KI(I_J_K)}XCUI
XF(K,Ee,i) TEHP=,5_RB$(EI(I_J_i_+EI(I_J_))XCVI
IF(J,K_,i) TEHP=,5_RBS(EI(I_I_K;+EI(I_K})_CUI
RHUL_Z_J_Kj=_,_70E-US_eRT(TEHP_3)/_TEHP_198,6)
CONTINUE
Figure 4.3 Section of TURBDA
CV£ = £.0/¢v
DORLL Z=i_ IL_J=J$Z_J£_K=KZi_KE_! USING /RI_/,/RS/_CUI
ZF<J,NE,I ,AND, K,14E,I) TEHP = RB$_EI<I_J_K}>_CVI
ZF (K.E_.I) TEMP=U.5×RBS(EI(I_J_Z)+EI(I_J_))_CUI
ZF(J,EQ,I;TEMP=U,5_RB_(EI<I_I_K;+EI(I_K})_CUI
RHUL_I_J_K; = _,ZTOE-USxS@RT<TEHP_3)/<TEHP+198,6>
ENDD_; GIUZNS /R6/
Figure 4.4 FMP FORTRAN Version of Section of TURBDA
4-7
4 •2 •2 •2 Geometry
When planning the discrete representation of a real-world system,
three major steps are usually involved. First the geometry or
general structure of the model is defined together with any sub-
structure expected to be used during definition of the structural
and temporal relationships of the model. In general, all of the
discrete points or elements of a model will be incorporated into a
set. Algorithms used to map this form of the model onto the
hardware will be described in Section 4.2.2.2.7.
The examples of the proposed language constructs which follow are
based on physical models and corresponding cartesian coordinate
systems. These concepts apply as well to transform spaces.
The geometry of the model is defined by first describing a sub-
structure of single dimension. This substructure is then used to
describe structures of higher dimension. Since the points of the
discrete model are usually identified with an ordered set of
integers, the construct used, called DOMAIN, is capable of build-
ing ordered sets. For example,
DOMAIN /X/ : L=I, MAXX
Here the name of the domain is "X". The domain is an ordered set
of values as defined with the implied-DO form. If M_XX=5, then
/X/ = {i,2,3,4,5_ . Two such linear domains can be used to define
a two-dimensional domain. For example:
DOMAIN /LAT/ : I=l, IMAX
DOMAIN /LON/ _ J=l, JMAX
DOMAIN /LAYER/ : /LAT/.X./LON/
Here the domain LAYER was defined to be the Cartesian product of
the two linear domains LAT and LON. The result is that LAYER
consists of all (I,J) pairs.
Two forms for describing geometries of interest will be described
below. The first, called DOMAIN, allows the user to define the
overall structure or framework of the model. As in standard
FORTRAN, this form establishes the maximum structure of interest
in the problem, and is used in the mapping to hardware to properly
allocate storage and processors. The second form, called REGION,
is used to dynamically specify an arbitrary set of elements from a
domain.
4.2.2.2.1 DOMAIN Declarations. The domain declaration can take
either of two forms; direct specification or construction.
For example:
DOMAIN /JK/: J=l, I0; K=I, 15
4-8
Note that if more than one domain-variable-set is included, the
resulting domain is assumed to be the cartesian product of the
individual linear sets defined by each domain-variable-set. In
the example above, the domain JK consists of all (J, K) pairs
where J is from the set 1,2,3, ..10 and K from the set
1,2,3,...15 .
DOMAIN /J/: JJ = i, 10
DOMAIN /K/: KK = i, 15
DOMAIN IJK/: IJl. X./K/
_]e last domain, JK, is defined as the explicit cartesian product
(cross-product) of the sets defined in domains J and K.
%
The syntax charts below use the same conventions as in the FORTRAN
77 ANSI standard document [10].
The syntax of the DOMAIN declaration statement is as follows:
domain-statement: _DOMAIN------_
_ domain_specparameters
[--- /-d°main name-/: _ domain construct expression__
A domain-name follows the normal FORTRAN rules for
variable naming, and may not be the same as the name of
any common block.
71
iI
domai n_spe c_par amete rs :
_ domain_variable set
R:'PRoF'_"I'
domain variable set:
domain_variable = - integer express ion{ , integer expr_
J.
>
4-9
4-10
A domain-variable is an integer variable.
The domain-variable is used only for notational conven-
ience during the definition of the domain. The domain-
variable does not remain "attached" to the domain during
execution. Other means for referencing elements of a
domain are described below (instance-identifier-lists and
instance-variables).
The integer expressions are evaluated at the point in
the program where the domain is declared. The domain-
variable-set establishes a sequence of values exactly as
for a DO-loop. If the last integer expression is omitted
it is taken equal to 1 by default.
domain_constructexpression
.N._
_-/ domain_name /
--domain spec parameters
i_(domain construct expression)
A donain-construct-operator is a set operator used to construct a
new set from two previously defined sets. The d_fined domain-
construct-operators are as follows:
.U. (union). The resulting domain includes all the elements
of both domains. All elements in the resulting domain are
unique (duplicates are deleted). The dimensionality of the
resulting domain will be that of the operands (which must
match).
.I. (intersection). The resulting domain includes only ele-
ments ,:hat are present in both domains. Dimensionality of the
operands must match.
.X. (product). Each element of the resulting domain corres-
ponds to a pair of elements (one from each of the operands).
_]e dimensionality of the resulting domain equals the sum of
tile dimensionality of the two operand domains.
2
°.
•N. (relative complement). The resulting domain is the same
as that of the first (left-hand) operand with any elements
which occur in the second (rlght-hand) operand removed. The
dimensionality of the operands must match.
The precedence order for the domain-construct-operators is
•X., .N., .I., .U.. Evaluation is from left to right_
Parenthesized expressions are allowed.
The size of a domain may be variable even if the domain is not
a dummy argument to a procedure. The domain is defined at run
time on entry to the procedure. Each procedure invocation may
cause a different-sized domain to be defined.
The variables defining the extents of the domain in the domain-
variable-set may be changed during execution of the procedure.
Such change does not have any effect on the size or shape of
the domain. Once the size and shape are determined on entry
they are fixed for the duration of the procedure.
Dimensionality is the number of domain-variables
needed to deflne" the domain.
4.2.2.2.2 Examples. The following are some examples of legal
DOMAIN declarations together with the actual DOMAIN defined.
DOMAIN /LONG/ : I = 1,4
2, 4}
DOMAIN /LAT/ : J = i, 5
{i, 2, 3, 4, 51
DOMAIN /ODDLAT/ : J = i, 5, 2
{i, 3, 5 7
DOMAIN /NORTH/ : J = 5,5
{5}
DOMAIN /SOUTH/ : J = I, 1
{1)
DOMAIN /MIDLAT/ : /LAT/ .N. /NORTH/ .N. /SOUTH/
{2,3,4}
DOMAIN /LAYER/ : /LAT/ .X. /LONG/
{(i,I) (2,1) (3,1) (4,1) (5,1) (1,2) ... (4,4) (5,4)}
DOMAIN /LEVEL/ : K = 1,2
{1,2}
DOMAIN /ATMOS/: /LAT/ .X. /LONG/ .X. /LEVEL/
{(i,i,i) (1,1,2) (1,2,1) (1,2,2) ... (5,4,2) 1
An alternate form of the above is
DOMAIN /ATMOS/ : /LAT/ .X. I=i,4 .X. /LEVEL/
4-11
4.2.2.2.3 Restrictions The prototype compiler should have a
domain dimensionality restriction to four domain-variables. This
restriction would help limit the problem of ,lapping to hardware
resources to an acceptable level of complexity. This restriction
would likely be lifted with later releases of the compiler.
4.2.2.2.4 Scope. The scope of a domain-name and the corres-
ponding set---_ points is a program unit. The scope of a
domain-variable is the domain-declaration statement. When the
same domain must be used in several program units, it must be
delcared within each of them (like a named common block).
4.2.2.2.5 Required Order. The position of a domain-declaration
statement within a program is the same as "other Specification
Statements" (see Figure 4.5).
Comment
Lines
FORMAT
and
ENTRY
PROGRAM, FUNCTION. SUBROUTINE, or
BLOCK DATA Statement
PARA34ETER
Statements
DATA
Statements
IMPLICIT
Statements
Other
Specification
Statements
Statement
Function
Statements
Executable
Statements
END Statement
Figure 4.5 Required Order of Statements and comment Lines
in a Program Unit
4.2.2.2.6 Application and Usage. The DOMAIN specification will
be used to define all those discrete points o£ the model at which
state information and/or processing will exist. Each discrete
point of the structure or substructure of the model will be
represented by an element in some domain.
4-12
4.2.2.2.7 _ The geometry as specified must be mapped onto
the available hardware as a mapping under which the actual
modeling and simulation will run. Many possible mappings exist
with many tradeoffs to consider.
A simple static mapping was proposed and used during the appli-
cation analysis. Such a static mapping is easiust to implement,
and results in the least compiler complexity. With such mapping
the least run-time overhead is devoted to mapping and concomitant
data rearrangements. With this mapping, the linear representation
of a domain is mapped to its corresponding processor number modulo
the maximum number active processors. The linear represent at ion
of a domain is the same as the storage order of an array with the
same subscript values and subscript variable order as the domain-
-variables. For example for an 8 processor system, domain LAYER
would be allocated as shown in Figure 4.6. The compiler code
should be sufficiently modular that other mappings can be uasily
evaluated.
The static mapping was used during the application analysis
summarized in Chapter 3. The application analysis results show
that such a static mapping will support the applications studied.
Thus, the FMP is feasible to implement, even without possibly more
elegant mapping techniques.
DOMAIN /LAYER/ : J=l,5 .X. I=1,4
PROCESSOR
0 1 2 3 4 5 6 7
(i,i)
(4,2)
(2,4)
(2,1)
(5,2)
(3,4)
(3,1)
(1,3)
(4,4)
(4,1)
(2,3)
(5,4)
(5,1)
(3,3)
(1,2)
(4,3)
(2,2)
(5,3)
(3,2)
(1,4)
Figure 4.6 Modulo Mapping of Elements of a DOMAIN to Processors
4-13
Many other possible mappings exist. Another simple static mapping
is where the first "n" elements would be assigned to the first
processor, the next "n" elements to the next, and so on. Here "n"
is defined to be the next integer equal to or larger than the
total number of elements divided by the number of processors. In
the above example, domain elements (i,I), (2,1), (3, J) would be
assigned to processor 0; (4,1), (5,1), (1,2) would be assigned to
processor I, etc.
Another consideration could be the locality of reference. In this
case elements cou]d be mapped so that the processes associated
with all elements assigned to a processor tend to ref_..rence data
already physically in that processor thu.'_ reducing trall [c to the
extended memory. Dynamic allocation strategies could al:_o be con-
sidered. However, dynamic allocation must balance the b,_nefits of
a possible mort. uniform use of all processors with the Likelihood
of increased mov,,ment of data to and from the coma1_on memory. (in
a static mapping, variabh_s which are referenced only by the
instances within a processor could be assigned storage space
within that processor.) Furthe[ study is needed to determine the
most cost-effective strategy.
4.2.2.2.8 REGION Statement: Facilities to identify a REGION of
active interest within specified DOMAINs are providt, d. R1_ese
REGIONs do not constitute a separate structure. Essc:lltially, a
REGION is a virtual domain with dynamically selected ,,lements of
the original DOMAIN. The elements may be sparse or dense,
rectangular or skewed sections of domains. The REGION declaration
may be used to explicitly create a virtual domain with dimension-
ality grt, ater than its original domain or to define which portion
of a DOMAIN is to be "processed". The specification o[ a REGION
may be dynamic. The general form of the REGION declaration is:
For example
REGION/JKPART (J=l,5; K=l,9)/= /JK(J+5, K+2)/
The values o[ J and K specified for the region named JKPART are
subst[tut-_d in the expressions associated with domain JK to
determined the correspondences. Specifically:
JKPART (I,i) is "equivalent" to JK (6,3)
JKPART (2,1) is "equivalent" to JK (7,3)
JKPART (2,9) is "equivalent" to JK (7,11)
re_]ion statement: --REGION
/ region_name (domain_construct_expression*) / --
_--- = / domain name
(_integer expressions)/,
9
4-14
domain construct_expression* is the same as the construct defined
earlier except the / domain name / part may not be used. The
domain variable names are not variables to which values are
assigned. Rather they are dummy names used to define the mapping
which identifies that part of the domain which is the region of
interest.
Each of the integer_expressions may be a linear combination of the
domain varibles used as part of the REGION declaration. Refer-
ences "to intrinsic functions will be truncated to integer if
necessary.
If the REGION is to choose a one-to-one mapping to DOMAIN elements
in the same order as in the DOMAIN, the "*" identifier may be used
instead of an integer expression for that domain dimension.
i
4.2.2.2.9 REGION Declaration Examples. Assume that the following
DOMAIN declarations exist:
DOMAIN /LAT/ : I = i, 20
DOMAIN /LON/ : J = I, 30
DOMAIN /LAYER/ ; /LAT/ .X. /LON/
The following regions correspond to the drawing in Figure 4.7.
REGION/LAYER1 (I=I,5;J=I,IO)/=/LAYER (I,J+10)/
REGION/LAYER2 (I=I,IO;J=I,10)/=/LAYER(I+5,J)/
REGION/LAYER3 (I=I,10;J=I,10)/=/LAYER(MOD(I+I5,IMAX),J+20)/
Note in LAYER3 that LAYER3 (i,i) = LAYER (16,1)
LAYER3 (5,1) = LAYER (20,1)
LAYER3 (6,1) = LAYER (l,1)
LAYER3 (10,1) = LAYER (5,1)
4.2.2.3 Model State
After defining the geometry or structure of a model, the state of
the model at each point of interest in the defined structures and
substructures needs to be described. The state can be described
through the use of the various variable types and declarations
available in FORTRAN. Since with the applications of interest,
the state variables are the same at all points throughout the
structure, a new construct, called INALL, was defined to help
simplify the description of the state throughout a structure. For
example, the declaration:
REAL INALL /LAYER/ WNDVEL, WNDDIR, T, P
4-15
30
20
T
]0
LAYER 3
(RIGHT HALF)
6 • , • 10
LAYER ]
"T
LAYER
LAYER3
(LEFT HALF)
3 • • • 5
lO
O
l lO 20
I ----..._q>
Figure 4.7 Example Regions Selected from Domain LAYER
4-16
defines the variables called WNDVEL,WNDDIR,T, and P. Unlike
standard FORTRAN,this declaration specifies that each variable
occurs "INALL" of the discrete points defined by the domaincalled
ATMOS.In other words, a different variable WNDVELegists at each
point or element of the domain LAYER. (Recall that LAYERwas
defined in the exampleaboveas
DOMAIN/LAYER/: I=i,20;J=i,30
Variables defined with the INALL statement can be used in FORTRAN
the same way as dimensioned variables as described later. In such
a use, the subscripts identify the point (element) of interest in
the structure.
The result of the INALL statement is a set of wind velocity, wind
direction, temperature and pressure variables at each point of the
domain. The storage reser_,ed would be the same amount as a
dimension statement of the form
DIMENSION WNDVEL(20,30), WNDDIR(20,30), T(20,30), P(20, 30)
However, unlike variables declared with standard dimension state-
ments, the "inall-variables" can be considered to be simple,
unsubscripted variables when defining the process to l,e simulated
at each point in the model, as described later. When the names in
the INALL declaration have dimensionality, the implied subscript
positions of the domain variables precede the subscript positions
whose dimensionalJty is explict.
4.2.2.3.1
the form :
INALL Declarations. The INALL declaration would take
inall statement:
_-INTEGER
_-REAL
_-COMPLEX
DOUBLE PRECISION _
LOGICAL
INALL.
L/domain_name/_--_ inall.variable
4-17
i _ u
4-18
If a type is declared, it applies to each of the inall variables
listed. The inall variable can be a variable name, an array name
or array declarator. If no type is specified, each inall variable
on the list will be implicit type or as specified in a-separate
type statement.
The INALL declaration serves the dual function of declaring the
type of the variables on the list and of declaring storage require-
ments. The INALL declaration semantically indicates that each
element of the domain defined has associated with it variables
identified in the list. Thus, if there are i0 variables on the
list and if the domain declared has 3 dimensions with extents of
3, 4, and 5, then the storage space reserved would be 3"4"5"10 or
600 storage units.
4.2.2.3.2 Scope. The scope of the INALL-variable name is a
program unit. If an INALL variable is in a named common block,
all names in that named common block must match in the several
program units where they occur.
4.2.2.3.3 Application and Usage. Each element of a domain will
include the set of declared inall-variables. That is, a unique
set of inall-variables will exist for each point in the domain.
The language constructs used for referencing these variables are
described in Section 4.2.2.6.
4.2.2.3.4 Mapping. The physical storage allocated for the inall-
variable set corresponding to each point of the speciEied domain
would be allocated to the storage of a physical processor in the
same manner that the domains are mapped (see Section 4.2.2.2.7).
As a result each processor will contain as many inall-variable
sets as the number of elements assigned to that processor for each
domain. Figure 4.8 is an example of this allocation.
The purpose of the DOMAIN and INALL statements is to define an
application-oriented data structure. The structure is hier-
archical. The major divisions (the grid) are defined by the
DOMAIN statement. The subdivisions are defined by the INALL state-
ment and consist of the state variables and arrays defined to
exist at each grid point. The data structure definition is inde-
pendent of how the structure is mapped onto the storage of the
FMP.
4.2.2.4 Process Modeling
Once the structure and state of the model have been defined, some
means of describing the process to be modelled must be provided.
This description is done in two stages. First, the general flow
of the sequence of events which occur during the l_L-oct,_s would be
described. This general flow description allows the dependence of
subprocesses over t ime to be defined. Second, the detailed
relationships that exist within the defined structures and
substructures are defined. These relationships exist for each of
the events. The dynaJnic "behavior" of the discret_ system is
defined by the combination of the general flow and thc_ detailed
relationships.
,-I
II
I-I
" 0
b. ,., tJ'}
II O
• ° _ _
Z
ZH
I-I
Om
u'3
CN
,-.4
O
_ v
_ 121
> m
> >
v v v
_ 121 _
,--I 04
" " 0,7
In
0
0
0
H
!
,--I
t--I
H
u_
0
0
.,-I
0
t--I
t--I
oo
.,-I
4-19
Standard FORTRAN is a language with inherent sequential depen-
dencies. That is, execution is constrained to be one statement at
a time in a well-defined order. The general flow does have such a
sequential dependency. Standard FORTRAN control constructs can,
therefore, be used to describe the sequential dependency.
However, when modeling real processes, there are also concurrent
actions which must be considered. If the language provides a
means for describing the concurrency naturally inherent in the
processes being modeled, the the mapping of the abstracted process
to the hardware will be more straight forward and the user should
find it easier to define the abstraction. This concurrency in the
general flow can be described with the construct called DOALL.
4.2.2.5 DOALL Construct
The basic form of the DOALL construct was shown in an earlier
exalaple (section 4.2.2.1). Recall that a segment of standard
FORTRAN code (which describes the computation required to evaluate
the process of getting new values from old values) is started with
a DOALL statement and ended with an ENDDO statement. Figure 4.9
is another example. This is a section of the FMP FORTRAN version
of SMOOTH (see Appendix A for a discussion of the application
code). Note the region THREED. This region is three-dimensional.
Variables SS, TI, T2, TS, T4 and a vector CT(5) are defined at
.each point in the domain MODEL. The statement marked _ is the
4. conkrol statement that indicates that all statements_-_-from that
potent to the statement marked Q are considered to be replicated
(one set of statements to each point in the domain), that each set
of statements (called an instance) is independent from all other
sets and that all sets of statements could execute_, concurrently
(given sufficient resources). The code marked _) tests to
determine whether the particular domain point is in either the J=2
or J=JMAX-I plane. If so, then the next few statements are
executed. If not, the statements following the ELSE are executed.
Note that these sequencing decisions within an instance are based
on data unique to the instance and do not depend on any other
nstances. Also note that the statement following the one marked
is not permitted to begin until all instances have completed
execution.
It is interesting to note that if there are fewer processors than
instances to be evaluated, then the work would be spread out
across the processors. Each processor would evaluate more than
one instance. Since only one set of code would be required per
processor, multiple instances are executed simply by cycling
through the code for that segment. The term cc_cles is used to
indicate how many instances a processor has evaluated of the
currently executing doall construct. The allocation of specific
instances to processors would be static and would use the scheme
previously discussed with regard to assignment of elements of a
domain to processors (which is the same problem).
i 4-20
..q
:(
'=3
=; :00
-_: _00
i 300<=_! 4
--_:. 4_.0
.3 4_0
---"_ I 430
_ 4u
_::. 200 C
(9-':ooo
.;.2o_ x300
.::. :7oo
_:: .:,90 0
, ."000
_-G ,_.o
;.. 2,=00
--_._ 2300
---51 i__?_,:. 2400
_-.,, _ 250 0
_;- 2600
i-;g! .'-700
_-:'_:. 2800
W-2/.,
C4:_s soo
_:'::
i_;:÷
, ._.
:,c !
i :y'
l=_S?
_._
_,UBRDUTZ NE _.HDnTH
C OI|HD|'.I/'BR_ E/I'.IH RX, .J14 R X _ KI4RX, LIIR/_ DT ,, ._ R I.I1'1R ,, GRH I ,_ F_HRCH
, DX.L _ D','Z _ D=.Z, FV _.5 J, FD (_ ._ _ HD _ RLF, GD _ OHEGR., HDX, HD'_ _ HDZ _ 61_ _
Dr'IIIRIN /HBDEL,I'; J=J._b0,_ I<=', _50_ L=J._00
REGInN /THREEDC_J=_,JHHX-_.r_.K=P_KI4RX-3.}_,/,L::.'_,_LtIRX-J.))z
:_ .= /14rlDEL,_ J _ K _ L,. /
[{'TH ORDER _HDnTHXN_I_ _D ORDER FIT THE: BDUNDFIRI"E5
DE]FILL ,.'THREED(J,K,L,_/ _ U_.;NG Q,_ _ _HU
TEHP = 3../e<JtK_L,_b.J
DO .i. N=.,._,5
COIIT_NUE
;F (J.EO.;" ,DR, J,EG,JIIRX-.L.= TH'.:N ""_'.:'{_(}' '":UI,ITY0_' 'l'lil'
7z = _,_J,-_,K_L_U_ '.,:l_. _.('-i.':I:::P(M')_;.
b=J 2 t,1=3._5
5:_ = :E.(J_,K_L,II} '_ U.r..,).::_l"lUx:e{JYi,_KfL,_l.l.l^T1) - ;>, .,:TE'N/
;. Q _ J-J._ K_ L_ I,; } _T;:" ) xTEI4P
COI_TINUE
ELSE
DO .3 I,l=l_b
T 3.= e ( .J-r ::' _ k ,, L _ _ )
";'_=e _.'J-;:" _ I'¢._ L ,i IL, )
";_=e _ J"r.L, h_ L_ L..,
Tq=e(d-J.,_ K_ L_ b)
_:_: = _(J_I_,L_I_) -r _HUxk_'.J.r;_K_L,I_)xT3. -e @(J-L:,F.,L_I_,_T _" .,.
3. _,x,:e_J-r.L?l-'.?=_l.I)xT_ "r e,....I-..t.,r_,L_l.I)>..TH._ - G,_CT,;.N))xTEHP
CnhlT INUE
EI,ID 2 F
KNDDD/THREED/
I
Figure 4.9 Section of FMP FORTRAN Version of SMOOTH
4-21
4.2.2.5.1 Construct Definition.
construct is:
The specific form of the DOALL
doall_construct:--doall_statement--doall program segment --_
_----enddo statement
The doall statement is defined as follows:
doall statement: --DOALL
ruse list -_
The main purpose of the DOALL construct is to identify those
processes (in the doall program_segment which can be concurrently
evaluated. The doall_statement identifies the points (grid
values) at which the doal!_program segment would be evaluated.
The points may be previously defined (as a domain or region) or
may be defined as part of the doall statement.
m
omain_ident if ier:
/ domain name
7 rag ion--_name _
_--- (U inst ance_variable _-_ )_
t
/
Each instance-variable is an integer variable and is unique from
all others on the list. Each instance-variable represents one of
the dimensions of the declared domain. Instance-variables are not
required to be the same as the domain-variables used when the
domain was originally declared. For each instance the set of
value_ assigned to each of the instance-variables at the start of
evaluation of each instance is the set of values used to uniquely
identify that instance. The scope of the instance-variable is the
doall-program-segment. The instance variables are allocated
stor_;e in the local processor memory.
domain snecifier.
/ domain name /:
domain_construct_expression
4-22
The domain-namecould be included on the ENDDOstatement which
terminates the construct (to assist in program readability). When
a domain is declared as part of the DOALLstatement itself, its
scope is local to the DOALL-program-blockitself. The domain-
variables used in the domain-specification-parameters become
instance-varlables as described above in addition to being used to
determine the extent of each of the dimensionsof the domain.
The doall-program-segment is a set of _P FORTRAN statements which
describe the process to be evaluated at each point specified. In
terms of the model, the process defined in the doall-program-
segment is conceptually evaluated at all points simultaneously.
In addition, the process at any one point does not have access to
the values computed by the same process at any other point in the
model. Figure 4.10 shows this independent, concurrent structure
in a "flow-chart" form. The evaluation of a doall-program-segment
at a given point is called an instance. All instances of a
doall-program-segment will complete execution before executing the
next statement in sequence after the ENDDO. Although conceptually
all instances execute concurrently, the actual order of execution
is dependent on the processing resources available. The only
implementation requirements are that all instances must complete
execution prior to continuing with the next statement after the
ENDDO.
The use list is to spe_ifically identify which variables or arrays
are used within instances of the doall-program-segment. The
specific form is:
=
!i<,:_,,_,_
use list: --
;USING
variable_name
/ common block_id/
inall variable
array_name
I
_--/ domain name /
The purpose of this USING clause is explained in more detail later
(section 4.2.2.6).
The last statement of a doall construct is the enddo statement.
enddc _statement •_ ENDDO --)
C
y,' _7-_._regd°main--name _/_ g ene ratelon_ ame , list
f
%
4-23
5,,<
DOALL
0 0 0 0
EN DDO
Figure 4.10 DOALL Construct Control Flow
4-24
generate list:
--IGIVING---- 
variable name
_/ common block id /--
inall variable
arrayname
i____/ domain name /
The generate list specifically identifies those variables or
arrays produced for reference upon completion of the doall
construct.
4.2.2.5.2 Serial FORTRAN Equivalent Fo_m. Any DOALL co_lstruct
can be simply represented ii] standard FORTRAN with nested DO
loops. The depth of nesting matches the dimensionality of the
domain over which the DOALL is defined. In fact, the domain-
variable-sets (See Section 4.2.2.2.1) used to de_ine the domain
become the control part of the DO statements.
4.2.2.5.3 llested DOALLS. The doall-program-segment may itself
include a DOALL construct. Since the application programs of
interest did not require this construct, no evaluation of possible
run-time efficiency or inefficiency has yet been made. Since
dynamic resource allocation has not been proposed yet, nested
DOALL's would be statically decomposed to serial form.
4.2.2.5.4 Mappi_____ngt The mapping of the doall-program segments is
the same as that descriDed for an element of a domain. However,
since each instance executes the same doall-program-segment, only
one copy of a program-segment need be kept by each processor.
4.2.2.5.5 Restrictions. No instance of a doall-program-segment
may reference the results of the current computation o£ any other
instance in the same doall-program-segment. Each doall-program-
segment has access to all of the values of the model at the start
of that program-segment. The entire DOALL construct must be
treated as a whole in order to control the implementation and use
of the construct. For example, consiOew a hypothetical sy:_tem
where such a restriction did not exist and suppose that the
computations performed in one instance did use the value of
variables in another instance of the current doa]l-prog_:am
segment. Under these conditions, successive runs of the _,rogram
are likely to get different results since the time o_er of
execution of the two program segments is not necessarily the sam_
from one run to the next. As a result, the vari_,ble values
fetched from the s_cond instance are either old values o_ _ ,re new
values, but with no control or "knowledge" of the ,,ncom[,assing
program that such a variation occurred. Programs w_ ul, i be very
difficult to debug in such a hypothetical system.
4-25
4-26
Since no referencing between instances of the currently executing
doall-program-segment is allowed, the results of evaluation are
completely independentof the order of execution.
Becauseof the concurrency expressed in the DOALL construct, the
arbitrary transfers of control which are allowed in standard
FORTRAN must be restricted in FMP FORTRAN. No trans£ers irate a
DOALL construct may be made. Within a DOALIJ construct, normal
FORTRAN control constructs (IF, GO TO, ...) are a lie)wed, but
control must remain within the DOALL construct. All instances
exit via the ENDDO. N,) transfer:_ gut u| an in_tanc.,, at,, allowed.
4.2.2.6 Variabl(_, Retercr, cing
The standard FORTRAN L_ef(:}_:rlcing conventions alJply. One extension
has |,cen defined tu simplify the description el the models of
interest. This extension sup[_orts the cuncept of "centered-
subscr ipt s".
4.2.2.6.1 Re£erc,,cing Within a DOALL-_jrogram-segment. 'l'he DOALL
constL'uct described above, is used to define the time s,:quencing of
a modeling process. At each discr_Le ti,m :;Lel,, s()m(: sort of
interaction between th,: various par|s el a m,),l, l take:_ place. In
particular, the modelling task ,ilay involv,, th,: u._;,:_)I g,n]cral :;tare
variables, of. state _ariab|e_ _,, u_%iqu_: to <:ach _I, m, ht ,_[ ;] _h)h_,_ih
and intermediate var_ ables used dL] _'il}g t h. ,'v;;[u_! i_)n of a
process. General sitar,., variables _.|re u_;ed i,__ ,h.:::(:y:Ji,,. _*,_ uvor_[l
process or st ruett, ru _ no may |,c ref,;i-c=_]c¢c] wi.b n_ ,]d<'i; iq: t.l_-Jc,,.
q%le .';_ate val: [al-,Iu:-; d,-,[J {le_l _t _,acli lx_int ,Jl ..i (jr_{;..: . |h,.i? b,:
}" " ( J [1 'Z { ' ( ' { ] $ J l" J ( { i') y [ "{ L-- ( ) [ _ ( _ Yl [ J ¢ J _ ' { ( : i ill''({ ;|( O[_i{(-zr ]2¢)i{i[;;. }[r._p ,',:-', t]l,_
il'lt(:[-;{l(!(| i .'_,L: V;_'J U,:',_ uhed (lur_ 1},1 tile ,:v_, !uat ion o[ a {,r,_k:.,_;:; w¢,ut ,]
b,: el c,H{c,.['i, oti[y to each i,,stal_c¢ .led l]Ot t_) ,uly ,llh,",,_,. ._.
_rcl',r I ,_ hdv, a_ or-k:r [y flow el d,_t;_ an,:i ,'1'[ local iol_ el ;;t,)ra_j, il!
Lhe syst, ;:;, :;, ,., I .:ll]gU;{(|(? COllSL[ Llct :g ]ld\,.J i_uul_ I)rOl .,, ;ed Wl, i ,?h
relat,_ _ t1_,, ,d>,)vu dei),:'ll,{_nciu:_.
The 9cneval :;taLe va_Jal,le:; (Lllo_;c w,:-iab]es wl,J_:l, a!,l,ly :_(:, all
points of a dom,,ln oI: re, l£n,) will b,: called GLOBAL variables.
Those state va[iab[e.<; which have [,cell de££l_e(! £or _a_ h of the
points of a domain w] |I be called ,qTRUCTURE vari,d;:es. Any
variables witl_ values u,_neratc(| and used only within an instance
will be called LOCAL va_:iabies. Note that GLOBAL var[]l,les would
hot be defined using ]NALL stat:,:,Bel]tS.
The difEerentiation el these different "classes o! use" is impor-
tant in a multiprocessor such as the I'MP because o! the added
complexities el stordge allocation and storog<_ manag,]m,,nt. The
additional constructs already defined provide application-oriented
ways to define variable u,_;age t<_ the com_)ller.
The USINGand GIVINGclause:_of the deal] and enddo r:;taternentsare
a natural way to explicitly c]elJne th,! data-d,:pendc,ncy el the sys-
tem modeledat the process-lcw_l. 'l_ir. compiler, in turn, will use
the:_,, :_tatemcntq, togctheL" with anaiy,_is of the source cc,c],:, to
produce code to initiate early requests o£ data transfers from EM
to the processors and back, thus further improving throughput by
allowing moL'e overlap of execution with fetching data from the
Extended _%emor y.
Any varial,le tl;_c,d w_thJll a ¢loall-program-segrn_:llt but not d_:clared
in the I!SING cl ;IH;;O IRuF_t bt' :_,:}f-d,;fJ ni ng with ill _:ach il_qtance
pric)r to Jtq u:;t:. 11 ;_ variable is li_,t illcltl(]ed irl il tJSIN(; or
GIVING c]au::c th,_ iml:lication is that the.' variabl,, iq only rm_ded
teml,_t'arily during the: ,:valuation el th,_ prates:;. Thu;;, in order"
to corl:-;ider stol-/lgc: requJ.r_!lllel]ts, variab].,2:-_ not decl;lt',.d eitller
USING at: G1V1NG n,.r_ed exi:-;t only for each "actiw-" instance., ratller
than for each instance, (An "active" instance is _,n irlstllnce
which is being ,..'xectl_,:d by a proce:-;sor resourc,.., )
If a vat_ial,l,_ i:: included on a USING or GIVING clause, but is not
included in an INALI, d,..claratiul,, the implicati,,n i,; that all
instances of the., doall-program-segm_;nt will ref,.renc,' tlxu same
variable (GLOBAL variable). When this condition occurs;, the com-
piler would aIlocatce space foL- such a variable in each processor
and generate code which would cause the value of such a variable
to be broadcast to all p, ocessors rather than requiring each
instance to separately request access to that variable. Variables
of this sort were previously called "CONTROL" [[,2].
If a variable is included on both a USING or GIVING clLiu:;e and on
an INALL declaration, the implication is that each in_;nancc will
require its corresponding INALL variable (recaLl that INALL
creates a variable for each point in the domain). Special sub-
script forms defined in the next section can be used to reference
INALL variables in other instances.
Figure 4.11 summarizes the variable use interpretations based on
the statements defined.
The importance of the independence of the instances of a deal1
program segment has already been pointed out. All GI,OBAL and
STRUCTURE variables as well as all instance identifiers (used to
identify the set of indices which define the point in the domain)
can be considered to be preinitialized to their value Ul,On entry
to the DOALL construct. At that point, at least conceptually, the
evalu,ation rules within a particular instance are int,._rpreted in
class [cal FORTRAN fashion except that the va [u,_s a:_:;igned to
GLOBAL or STRUCTURE vaL-iables can be referenced only by the
instance which did the assignment. All other instances would
contiuue to reference the original values. Similarly, a set of
instaltce identifier variables wouh] exist for each instance. The
initial values in the set for a particular instance would identify
the instance. Any changes made to the values in one s,.t could not
be observed within any other instance. The FMP (har_lware and
software) will enforce these referencing procedures.
4-27
declared
INALL
on
DOMAIN
YES
NO
r
in any USING or GIVING clause
YES
STRUCTURE
GLOBAL
NO
LOCAL
LOCAL
Figure 4.11 Variable Use Interpretation
4.2.2.6.2 Centered-Subscripts. The intent of the new constructs
described ( DOMAIN, INALL, DOALL,... ) }]as been to al low the
description of a model and the modeling process as [t reflects the
process and state at each discrete point of the stt-uctures of
interest. References by the doall-program-segment to variables in
the same element of the DOMAIN as the instance need only be by the
simple name• For example,
REAL INALL /ATMOS/ T,WNDVEL, AB(7)
is a statement declaring variables T and WNDVEL and a vector AB at
each element of the domain ATMOS. In the following program seg-
ment, hhe process defined compu _s new values which are a function
only of old values in the same instance:
DOALL /ATMOS / USING WNDVEL, AB
T = (AB(1) + AB(2))/2
WNDVEL = (WNDVEL + AB(3) + AB(5))/2*AB(4)
ENDDO /ATMOS/GIVING T, WNDVEL
Many models have dependencies between elements of the structure.
When describing processes of this type, a natural approach is to
describe the process centered at a particular element ;_nd consider
the rest of structure with respect to the centered element. When
referencing INALL variables in other instances, a susc_ipt is used
in a manner similar to normal array and vector referencing in
standard FORTRAN.
4-28
+Another example might be:
i
0
DOALL /ATMOS (I,J,K)/ USING T
T = T(I,J i)
ENDDO /ATMOS/ GIVING T
llere the new value of T at each element of the structu_+_ is made
equal to the original value of T on the lower plan_. • of the
structure. (i.e. All elements of a column of /ATMO8/ have the
same value of T as the value of T in the first element of the
column.) Note that I and J are constants throughout the doall-
program-segment since they are the instance-variables. The "*"
may be used to indicate the value of the instance-variable corres-
ponding to the e[_._ment of the domain. For example, another way of
writing the preceding example is:
DOALL /ATMOS / USING T
T = T(*,*, I)
ENDDO/ATMOS/GIVING T
When variables in adjacent elements of a domain are to be refer-
enced, subscript expressions may be used. For example:
REGYON/CENTRAL (L=I,IMAX-2;M=I,JMAX-2;N=I,KMAX-2)/
=/ATMOS(I+I,J+I,K+I) /
DOALL/CENTRAL(I,J,K)/USING T
T = (T + T(I+I,*,*) + T(I-I,*,*) + T(*,J+I,*) + T(*,J-I,*)
1 + T(*,*,K+I) + T(*,*,K-I))/7
ENDDO/CENTRAL/GIVING T
+
! •
/.
j
!! 4-29
Yi
In this example a REGION was declared that excluded the outer
boundary of ATMOS. The DOALL computed new values of T based on
all immediately adjacent values. Note that variables declared
INALL over a DOMAIN are also accessible to any REGION o£ the
DOMAIN just as if they had been declared INALL on the REGION.
Also note that values of T at adjacent elements of the DOMAIN are
used to compute new values of T at each element of the REGION.
All computation is based on the values of T throughout the DOMAIN
upon entry to the DOALL construct.
As a last example, note that a doall-program-segment ,,ely treats
values of INALL variables as having initial value up,_,l entry if
the GIVING or USING clause specified those variables. During
execution of a program-segment, the variables may locally be
assigned other values. Only the centered-variables are saved
under the GIVING clause.
REAL INALL/ATMOS/ T, WNDVEL
DOALL/CENTRAL(I,J,K)/USING T
TOLD = T
T = (T + T (I+l,*,*) + T(I-I,*,*))/3
XY(1) = T
XY(2) = (TOLD + T(*,J+I,*) + T(*,J-I,*))/3
XY(3) = (TOLD + T (*,*,K+I) + T(*,*,K-I))/3
T = (XY(1) + XY(2) + XY(3))/3
ENDDO /CENTRAL/ GIVING T
In this example only T(*,*,*) is saved upon completion of all
instances. The array XY and the variable TOLD are I,OCAL vari-
ables. These variables are used only by the active instance. In
order to conserve storage, the same processor memory locations
used for LOCAL variables during execution of an instance in a pro-
cessor can be used for the LOCAL variables of another instance
when more than one instance of a doall_program segment are eval-
uated in any given p_ocessor. Note that the original value of
T(*,*,*) had to be saved since the second statement changed the
value (as far as the particular instances was concerned). In this
way execution of a doall-segment is the same as th<Jt of any
FORTRAN segment with the INALL variables specified in GIVING
clauses initialized as if with a DATA statement upon each entry.
4.2.2.6.3 Unreferenced Variables. In some cases, a variable
identified within a separately compiled sgement, but never be used
within that segment. This happens, for example, if the main
program has a named common area that is used in a number of sub-
routines, and the area must exist in the main program for the
purpose of holding data created by one of the subroutine:_ and used
by the other. In this case, the compiler will not have access to
the USING and GIVING declarations, because of the separate compil-
ation. Uhtil a better way of handling this situation i_ defined,
the declaration of these named common areas will be expanded by
prefixing them with an indication of how they will be used, when
they are used.
4-30
STRUCTUREdeclares that the variables and arrays within tile
given commonblock will be used as if they had been included
in INALLstatements and USINGor GIVINGclauses.
GLOBALdeclares that the variables and arrays in the given
commonblock will be usedas if they had been included in
USINGor GIVINGclauses.
All variables and arrays in a given namedcommonmust be used in
the sameway (i.e. as STRUCTUREor GLOBAL).
4.2.2.7 Storage Allocation
The Flow Model Processor has two major areas of storage to be
concerned with during execution, the main memories of the
processors and the extended memory. The primary use of extended
memory is for the STRUCTURE data (the "old" state values).
Processor memory is allocated to program, and to data storage
space. The data storage space is further divided into temporary
areas used only while an instance is active (the LOCAl, variables)
and into areas which are allocated to each instance resident in
the processor. The data areas allocated to instances hold the NEW
values as well as copies of OLD values. Note that although many
instances of a process may be assigned to a particular processor,
only the data areas reflect that assignment. Only one copy of the
program-segment would exist.
The GLOBAL variables normally have storage space allocated both in
the processor memories and in the EM. This allocation is a space-
time tradeoff. If only the original copy existed in the EM, then
each instance would have to acces:_ it separately with potential
conflicts (when more than one proce:_sor try to access the same EM
location simultaneously, only one iu granted access; any others
wait). If the value is broadcast to all processors simultaneously
(say at the start of a doall), then any references would be to the
local copy already resident in each processor.
4.2.2.8 Independent Compilation
Program units, as with any conventional FORTRAN, may be separately
compiled. Note that there may need to be a distinction between
two classes of subroutines. One class would be those called
within a doall program_segment. These subroutines would be com-
pletely independent of any coordinator code and would have any
embedded DOALL constructs implemented as nested DO loops. The
other class of subroutine would be those called outside a doall
program segment. If a subroutine of this class did have an embed _
ded DOALL construct, both coordinator and processor code would
have to be generated in order to take advantage of the available
processors.
One solution to the above situation is to somehow identify one
class of subroutine from the other. This could easily be done
with a simple construct added to the SUBROUTINE statement itself.
For example
4-31
SUBROUTINEBTRI DOMAIN/J=I,JM; I=I,IM/
This would indicate that BTRI would be called within a two-
dimensional doall and that IM*JM copies of the subroutine should
be available to the instances of the doalls.
Other solutions exist. They include independent compilation for
checking purposes but full compilation to generate code files.
Amother solution would be to provide information concerning
location of doall constructs to the linker and have the linker
include coordinator code where needed. All of the above solutions
are still under consideration to determine the most effective
solution.
4.2.2.9 Code Generation
The compiler will produce code for both the coordinator and all
processors. A very straightforward division of control would
exist. That code required to synchronize DOALLs and to support
interaction with the external environment would be resident on the
coordinator. All other code would be allocated to the processors.
The DOALL constructs just described are simple examples of this.
The processors would each contain a copy of the doall-program-
segment together with some identification of that segment. When
the flow of control of the program arrives at the DOALL, the
coordinator would broadcast a "start segment n" command. When all
instances have completed and all processors have notified the
coordinator with "I got here", the coordinator would synchronize
the updating of OLD values in the EM followed by initiating the
next program-segments in the processors.
Program segments which are not part of DOALL constructs but which
are standard serial FORTRAN could be analyzed by the compiler to
determine any data dependencies. Separate, data independent code
sequences would be defined with the appropriate conditional tests
so that each processor would evaluate one section of the resulting
program segment. The controls in the coordinator would be the
same as for the DOALL case (in effect, a "DOALL" would have been
constructed out of the original code). Since the processors can
all operate autonomously, this approach should result in addi-
tional speed-up on serial codes. A speedup en the order ot 2-5
over straight serial execution is likely from this approach. The
application analysis summarized in Chapter 3 DID NOT assume this
speed-up of sections of serial code. Note that a separate high-
speed "scalar" processor is not required. Each processor is
independently capable of scalar execution, so that concurrency is
not dependent upon vectorization, as it is in today's vector
machines.
4-32
II
4.2,2.10 Functions
Functions on the FMP will include not only the normal mathematical
intrinsics, such as ATAN, LOG, EXP, and SQRT, but also a family of
functions that are brought about because of the parallel nature of
the FMP. The global intrinsics, which reflect the parallel
structure of the system, are described in more detail below.
Table 4.1 lists the functions which could be provided in FMP
FORTRAN. In addition to listing the function, the table also
lists the expected implementation (such as operator, in-line
expansion, or calls on external function subprograms. Some of the
functions will combine in-line code with external calls and are
marked for both.
4.2.2.10.1 Global Functions. The global functions have no analog
in a serial machine and are not nornally used in the direct
description of a model. These functions are useful in the
simulation controls and in the summary and analysis of the results
of a simulation.
The global functions operate across the declared parallelism
defined in the model structures. Fer example, the following
serial FORTRAN
A= 0.0
DO 1 J=l, i00
A = A+B(J)
1 CONTINUE
would be replaced by
DOALL J=l, i00 USING B
A = SUMALL (B(J))
ENDDO GIVING A
Note that this is implemented in two levels. First the sum of all
the instances assigned to a given processor generate a partial
sum. At the end of the DOALL construct, the coordinator generates
Log 2 p (P = # Processors) operation sequences to associate pairs
of partial sums to get a set of higher-order partial sums. These
sums are then paired and summed successively until one value
remains.
4-33
I.-,I
Z
rO
[,-1
_z
I,-.I
4-34
C C
0 0
•_ .,4
_ u u
,-4 ,-4 .,4
C C _ i
0 -,.4 .,4 C
uA4o
0_-_
_ P-I H I"4 H
4.1
C
0
o
_n
.,.4
_g
Iz,_,
0
•,.1 0
4.1 .,-4
g
.,4
_ "_to
_ 8
.-4
,--4
t_
C
I
C
I-I
H
0
[,.,
[.,.1
,-4
0
c
0
c
8 _
tJl
0
0)
o
•,4 ,-4
1 c c
C .,4 o
O o
_OO [-,-I
_0
_o
,-4
0
,I 4-I
c
o
_) ,-4
c
8 _
C C
•,4 .,4
,_ ,-4
! !
C c
_-4 H
z
<
4/
_, z°
c
o
.,4
.u
i-
_x
_H
J'.i
; E./_/
, :!,.
, k)
.>,.
I
J
ii
0_;
I-.I
r,.
Z
0
.=I
.,.4
4.1
D
i
4-
Ill
,-.4 ,-I
I I
l-I l--I
I-4 l--I
C
.,-I oM
,-I ,-I
I I
IM I-I
I-I I-I
zI-4
Z
0 0
• M _-I ,-'4
t_
0
,--I .,-I °,-I
I_ I I
_,_ I--i i-i
I I I
I--I I--I I.--I
Z
I-4
+ I
o
^_ _I ,..ml
I,.-4
Z
,-4
+ I
o
AX ml '_I
,-i
q;
,l.J
I
0
0
•,_ .,-_ _ .,-_
0
0
ml
ml
,ii ._
c
c
|1
o
o _1 V
o
c
.,..4
0
4-35
'1o
.iJ
u
,r4
8
m t) _z
_ o_
I I I
I-4 I-,I H
0
.IJ
r-i
t_
.r,l
t_
! I I
i-,I H
,.-I ,--I ,--I
I I I
1,4 I,-I I-4
.131
S_
I-I
o
('4
o Z Z
H
H
ol
4-36
..(3_ o4
o4
0 ,-4
.,.4
.o --_
.,-i
.,-_
£
u o
•,.4 .1_ i_1
o4 _
g_
O4 ,--4
N
0 _0
..4
o •
Ax
ol
v
t_
o
0-.
tD
C
..4
0
o
I!
,-4
.,4
H
,.i-i .,u
O_
o_
c u
I-.t
Z .,.-t
r,.
U _Z
H 14.-III1
_ O_
_ Z
c"_l
0 I
C
.-t
t_4 ¸
0
.5
,-t ,-t
C C C
I _
(31
¢q
,-4
o
O
u
O
C _ c
_u
o4
,_a
m
c
c
o
o_
c
m
c
,-4
_q .u
--4
,-I ,-I
Z
z
c
u}
c
o
-i
c
_3
u_
O
u
v
O
O
c
o
D_
C
_u
o_
o
C
c
8
C
0
4-37
mr,
-' 8
o_
-L I-4
• ,-1
" I--I
-2'
#.
<T'
h
.,,%'
.:, 4-38
o_
,-I ,-I
I.i I,_
o,--I
o
• _,1 ty)
_ Z
Z u)
.,.-t
g.4 _
_ U
_ "_
•[ 0 "_t_ m
0
.,.4
@
,-I ,-t
k.a k.i
._ 42
0
0 u
121
0
u
,-I
o
o
u
o
1..i
.i-.$
.i
t,.a 1,_ I,.1
,-I o.1
t,4 t..1
111 iI_
°_1
®!o
,-t ,-4 _1 ,-I _--I _1
I
c,4
0
0
.,-I
(1)
o_
_4
a_
A
8
I-t
U
Z
t_
H
CO
Z
I-t
O_
I-t
r,,
o_
0,_
o_
0_
I I
.-..I
.P
I _-,_
0_._ _ _
_ _ 0
0 0
_00 "'_ I
_oo _
oZ
_-_
O_
g
°_
0
.l.J
.-4
,M
U
g,
°M
ffl
H
O
ffl
t,4
0
la
&
.M
O
0 0
tt-I .--t
O.
U
0
_J
t_
.-4
O
_M t_.4
tn,_
_C
O
_M
•M ...t
! !
I-4 I,-.I
_ Q
0
0
[.,.,
0
0
,-I
,-I
0
.C
.u,
In
,-1
Cl
o_
0
Q)
..O
4-39
'o
J.J
v
¢n
z
:m
uJ _u
,-1
¢..)
_z
I.-.4
z
1-4
I b
DJ
4-40
0g
.,-4
0j
Q _ .,-.i
°_ _
H, c/] H
v,..I .iJ
(._ H H
0
O
.M
O _
_ H H
J
C,
,-t
,.'4
"il
im
Q)
o .u
•,-4 ,.-,
_ H
•...4 .,..4
I f
f,:
H
,-3
• ' H
<2 C)
,-3
H
[.,3
¢O
,--I
I,.,..-
,--4
O
.U •
0
.U
_J
,-4 .,.4
I
H
H H
H IJ3
I
I
L--
_n
_J
4_
_J
•,4 ._J .4 .I-J .,4
H .._ I-.4 + H
_4
I
,-.4
,--4
0
_M
0_
H _ C*
,9
m,
E_
°,'4
,,--,I
'4.,4
_O
O
OC
._._M._
_+H+H+
"-4
,-4
>
J_ x d_
oo ¢I
_oJ
LI j.j
,--4
;-.-I
<
O_
o
L'_m
m m
,-_ ,-4 ,-_
H _ C_
,..'1
Z
I-.4
03
,-.4 "_
,--4
tO
:>
.I-J
,-'1
--1
¢o
ml,._
MICROCOPY RESOLUTION TEST CHART
ttATIONAI BURLAU _,)1 $1ANDAR[_ |I_b.,_A
!i!
,-I
r_
_4
+
,-I
I
H I-4
C
r-t
el
+
!
[i)]
d_
C
8
I-,4
4, z
I-4
Z
I,,,4
I I I I I
,, ,,
u
0 o
o _ _ m _ _ _ .,_
._1
C _ I> 0
_ 0 +
tn 4.1 .,-i .-i
0 .._ _ 0
$_ _; _ II
o _.._ _ _
o
C .,-I
4-41
4-42
Table 4.1 Intrinsic Functions (Cont.)
Notes for Table:
Note i. The value returned by INT is that integer of the same
sign as a with a magnitude not larger than _a_ . If a is too
large, integer overflow is reported.
Note 2. The representation of these functions in the FORTRAN
source will use the standard double-asterisk notation for exponen-
tiation. The function called will depend on the data types of A
and B in A**B. The external function called is an alternate entry
into the EXP function subprogram.
Note 3. In FMP format, FIRST and SNGL are different names for the
same function.
Note 4. The values of i mark the elements of a domain. These,
and the following functions occur across all those instances of a
DOALL that execute the statement containing the function. Thus,
with 2 arguments showing in the code, there will be 2x imax actual
argtm, ents, where imax is the size of the domain.
Note 5. LOCTRU finds the instance number of the instance in which
the previous MAXALL, MINALL found a minimum. LOCTI{U with a
logical argument finds the instance number of one of the instances
in which that variable is true.
Note 6. RECURRENCE is discussed below.
iThe result of a global function is not available for use within
the DOALL in which it is called. Since the various instances of a
doall-program-segment may be executed in arbitrary order, any
given instance may complete before some other instance has
provided its input to the global function. Thus, the output of
the function is not defined until the execution of the last
instance. The results of the global function are available after
control passes the ENDDO.
4.2.2.10.2 LOCATION. The LOCATION function operates with the
assumption tha-_value returned is the instance number of the
successful instance of the most recent execution of MAXALL,
MINALL, ... The subsequent use of this instance number as a
subscript depends on the implicit equivalence between a one-
dimensional subscript and a unique multiple-subscript.
For example, given a structure variable A declared INALL over a
domain the laziest element of the array could be determined as
follows:
DOMAIN /LAYER/: I=i,i0000
REAL INALL /LAYER/ A
DOALL /LAYER/ USING A
IPTR = LOCATION (GLOBALMAX(A))
ENDDO /LAYER/ GIVING IPTR
PRINT A(IPTR)
4.2.2.10.3 RECURRENCE. The RECURRENCE function is only defined
over domains active on one-dimension. The RECURRENCE function
would be invoked as shown in the example below:
A(J+I) = RECURRENCE (A(J)*B+C(J))
where A is declared INALL across the DOMAIN.
The prototype compiler should implement only the simple form of
recurrence, with B constant. The additive term need not be
subscripted and may be missing. The constant B may be omitted
when it is equal to i.
RECURRENCE, the global operation, is the formation of a parallel
linear recurrence in nine (=iog2512) steps as demonstrated by
Shyh-Ching Chen in his doctor's thesis at the U. of Illinois [13].
In FORTRAN, consider
DO 1 J=l, 512
A(J+I) = A(J)*B + C(J)
i CONTINUE
4-43
•his program segmenttakes 512 steps, each with one multiply, and
one add. The parallel algorithm in RECURRENCE produces the same
result in nine steps.
Although global sums, global products, and the paraljel linear
recurrence are functions in the language, they are not always the
optimum programming method for producing these particular results.
For example, take the serial FORTRAN below.
DO i J=l,1000
DO I K=I,1000
A(J,K+I) = A(J,K) * B(J
i CONTINUE
) + C(J,K)
There are several ways to write this in FMP FORTRAN given that the
order of nesting the loops is irrelevant otherwise, qWo of them
ares
Method I:
DOALL J=l,1000
DO 1 K=I,1000
A(J,E+I) = A(J,K) *B(J
1 CONTINUE
ENDDO
) +C(J,K)
Method II :
DOALL K=I,1000
DO 1 J=l,1000
A(J,K+I) + RECURRENCE (A(J,K) * B(J
1 CONTINUE
I:NDDO
) + C(J,K))
Method I, which executes the recurrence serially in an inner loop,
runs ,_ver nine times as fast as method If, which executes each one
of the recurrences in parallel across each value of J in turn.
That is, method I is 512 times as fast as a single processor,
while method II is 57 times faster than a single processor. The
global functions are included for those cases where method I is
not an available option.
4.2.2.10.4 Efficiency of GLOBAL Functions. The global functions
are logarithmic in effic-{-ency for domains up to 512 in size. That
is, it takes nine steps to produce the 512-way result across all
512 processors. For larger domains, the global function is
executed serially with respect to all those instances executed on
each processor (called CYCLES). As a result, the number of steps
requited for SUMALL, for example, is N/512 + 8 where the domain
has N elements.
%
4-44
4.2.2.10.5 Direct Calls on Global Functions° An alternative
construct for g_l_-_io-_ _-s:
global-function-name /domain-name/ (argument-list)
For example:
S = SUMALL/DD(J)/(A(J))
is equivalent to
DOALL /DD(J)/ USING A
B = SUMALL(A(J))
ENDDO /DD/ GIVING B
This form is the equivalent of single-statement DOALLs when the
statement is a global function.
Boolean global functions may be used directly in IF statements
once evaluated, For example:
IF (ANY /DD(J)/ (A(J)))i.,
is equivalent to
DOALL /DD(J)/ USING A
DUMMY = A_Y (A(J))
ENDDO /DD/ GIVING DUMMY
IF (DUMMY) ...
When LOCATION directly £ollows such an implied single-statement
DOAIL, the compiler combines it into the DOALL of the previous
global function.
For example
MM = MAXALL/DD/(A(J))
IX = LOCATION
is equivalent to
DOALL /DD/ USING A
MM = MAXALL (A(J))
ENDDO /DD/ GIVING MM
IX = LOCATION
4-45
4.2.2.11 Assignment Statements
The following pertains to execution within each instance of a
doall-program-segment. Recall from sect_ 4.2.2.6.1 (and Figure
4.11) that three classes of variables exist in doall-program-
segments. They are called GLOBAL, STRUCTURE, and LOCAL.
All STRUCTURE variables have their old values when any instance
begins execution. Assignment to any structure variable from
within an instance will result in the new value being available
unly within that instance. Other instances would still refer to
the old value unless they too had executed a similar assignment
statement. Once all instances are complete, the STRUL_URE vari-
ables are all up_-_ed with the new values computed within the
instances.
Assignment to a GLOBAL variable or array element will redefine the
value of that variable within the instance in which the assignment
is made. However, the original value of the variable remains
available for reference by any other instance. Since a GLOBAL
variable must have only one value, a doall-program-segment may
assign new values to GLOBAL variables only through a GLOBAL
function which maps a set of STRUCTURED variable values onto a
single value. Such a new value is available only after the ENDDO
statement. All other apparent assignments to a GLOBAL variable
within the DOALL define the GLOBAL variable to the end of the
inst aD ce.
AssigLment to LOCAL variables may take place at any point during
execution of an instance. Operatio,_ is as with standard FORTRAN
except that upon completion of the instance, the storage space
allocated to such LOCAL variables would be reassigned upon exit
from the doall-program-segment. A compiler option will exist such
that ,OCAL variables would be assigned unique storage locations
for each inscance. In this case, LOCAL variable values would
carry over from one reference to another, even between different
DOALL constructs.
External to a DOALL, all references and assignments to GLOBAL and
STRUCTURE variables are valid. In such a case STRUCTURE variables
must be subscripted.
4.2.2.12 Miscellaneous Features
4.2.2.12.]. Same-line Comments. A reserved character, not in the
FORTRAN character set, will be defined that may be used to
terminate a statement. Thus, anything following on the same
physical card is co_ent. A likely character is "%".
When the syntax of a statement is such that the only allowable
characters on the rest of the card are blanks, the compiler will
not check. Thus, statements like ENDIF, IF (-boolean-) THEN allow
comments to be placed on the rest o_ the card.
4-46
4.2.2.12.2 Recurslon. Recursive calls are allowed. Note, that
although the second, _ nested call on the subroutine gets a second
set of subroutine-local variables and arrays (separate from the
set belonging to the outer call) any named common that belongs to
the subroutine will be the same named common area in both calls.
4.2.2.12.3 DO LOOPS. Since a domain consists of a finite ordered
set, the contr01 of a DO loop can be specified with a set of
domain elements. For example:
DOMAIN /LAT/_ I=I,IM
DO i /LAT/
is equivalent to
DO 1 I=I,IM
If the domain is multidimensional, the order of nesting of ,h_
"implied" DO loops is FORTRAN subscript order. That is, the first
variable is the index of the inner loop. The last variable is the
index of the outer loop.
4.2.2.12.4 EXIT Statement. The EXIT statement can be used to
terminate an--_ividual instance of a DOALL construct. In addi-
tion, a DO loop may be terminated with an EXIT statement. EXIT
statements are permitted wherever executable statements are
allowed.
4.1.2.12.5 Dynamic Array Sizes. Space is not allocated for a
named common until the first program unit using that named common
is entered. Likewise, space is not allocated for variables and
arrays of a program unit until that program unit is entered.
Hence, sizes of common blocks and dimensions of arrays can be and
may be set dynamically during program execution. The only require-
ment is that the expression determining the size be evaluated at
the point in the program where the declaration occurs. In the
case of arrays in named common areas, the size-determining
expressions must evaluate to the same value in every program unit,
or a run-time error is likely.
4.2.2.13 Input Output
All FMP input and output is staged via the Data Base Memory.
Since I/O is inherently serial, a mapping of concurrent execution
to the serial form supported by peripherals is required. That I/O
specified within the serial parts of FMP FORTRAN programs occurs
as specified. That I/O specified within a DOALL over a DOMAIN or
a REGION is processed as requested over time. Since the instances
of a DOALL are independent, no attempt to order I/O of one
instance with respect to another is made although the time
sequence of I/O within any one instance is maintained.
4-47
Formatted I/O is expected to be supported primarily by the Support
Processor since the major formatting load is on output. In addi-
tion, the applications studied were such that input formatting
could be accomplished prior to initiation of an FMP task. As a
result all FMP I/O would be direct I/O via the DBM. These assump-
tions have affected the instruction set choices in the processors
of the FMP. No powerful character handling instructions exist at
this time. Due to the heavy loading of output formatting to
support the COM load (excess of 10,000 frames of graphic info/
day), continued consideration is being given to moving formatting
support onto the FMP. The system as evaluated (with Support
Processor formatting functions) could certainly support the
expected workload. The remaining questions pertain to whether a
more cost-effective solution might exist.
4.2.3 FMP FORTRAN Compiler
As with any large design problem, a compiler development project
involves a number of stages including some means of testing design
ideas. The compiler discussed below is actually envisioned to be
a succession of compilers beginning with what would best be
described as a "prototype FMP FORTRAN compiler". Where appro-
priate, these discussions will point out features or capabilities
planned for the prototype compiler or planned to be deferred to
later versions.
The FMP FORTRAN compiler would execute on the Support Processing
System. Source input, generated code and other output would
reside in the NASF File System.
4.2.3.1 Functional Objectives
The functional objectives of the compiler are:
4.2.3.1.1. Support to the User. Not only should compile-time
messages be clear, but run-time aids should be provided for
debugging, for gathering statistics and for monitoring the dynamic
execution of a program. Other facilities should include gener-
ation of optional memory, array and extent bounds checks.
4.2.3.1.2 Support of the Language. The defined language (FMP
FORTRAN) would be the language supported by the compiler. No
changes to the language or compiler would be made without
consideration of the other.
4.2.3.1.3 Make Efficient Use of FMP Resources. The compiler may
never be capa--_eof implem-_-ti-ng the "most efficient" use of FMP
Resources. This inability is due, in part, to the data-depend-
encies which are run-time sensitive and, in part, to the com-
plexities of global optimization.
4-48
The prototype FMP FORTRANcompiler is expected to implement
limited optimization. The level of optimization at the prototype
stage would be "peephole" optimization giving improved overlap of
the FMP functional elements during execution. For example,
register allocation could be adjusted so that the store to memory
ending one statemert would follow the first fetch or two belonging
to the succeeding statement. Requests for data from EM (LOADEM'S)
would be positioned near the start of a program-segment. This
position should make it possible for CN delays to occur concurrent
with processing. Where possible, integer and floating point
instructions would be rearranged to improve overlap. Optimi-
zation of this sort requires local, straight-_%_ward data flow
analysis probably using the register addresses as data
identifiers.
Since static allocation of the defined processes onto the memory
and processor resources is planned, the resources might not be
used as efficiently as in a dynamic, "load-leveling" run-time
allocation scheme. Unfortunately no efficient, yet simple,
dynamic scheme has been studied as yet. As experience is gained,
static optimization will occur in two major areas; data or
storage allocation and processor allocation. For exam[le, as data-
dependency analysis improves, code can be generated which main-
tains STRUCTURE variables always within a processor if all
instances which refer to those variables are also within the same
processor. Data-dependency analysis would also likely be used to
assign instances of DOALLS to processors on the basis of least
communication with Extended Memory.
Another means toward meeting the goal of efficient use of FMP
resources is to generate efficient object code. Some of this
efficiency will be derived from classical compilation techniques
(feasible since most of the task involves generating code for
individual processors). Some of the efficiency will come because
of the simplicity of having only one program in execution at a
time.
I
4.2.3.1.4 Support the Operational Environment. The FMP Compiler
would be able to provide the necessary linkages to the logical
input-output subsystem. In addition the compiler would produce
the necessary information for the linkage editor.
Since the proposed FMP organization is very modular and is likely
to be implemented first with a limited number of processors, an
option which must be available with the prototype compiler is to
compile for "N" processors and "M" memories. This capability
should add considerably to the time available for debug and system
integration of the software since not all 512 processors need be
available to begin system testing.
4-49
i_-
41
_J
4-50
4.2.3.2 Functional Organization
Figure 4.12 shows the expected functional organization of the FMP
Compiler. The internal interface between all components shown
would be a common representation of the compiled program. Such a
common representation should allow the development of compiler
design and debugging aids. For example, the source generator
module could be used at any phase of compiler execution to
generate a record of the current state of compilation.
4.2.3.3 Domains
The prototype compiler would handle only rectangular domains. In
addition, the domains would be constrained to a maximum of four
domain variables with constant spacing. These restrictions are
suggested to reduce the prototype compile complexity. The hard-
ware proposed would tolerate any kind of index set as a domain.
Language features have yet to be proposed for describing such non-
rectangular domains.
4.2.3.4 Data Flow Analysis
Data-flow analysis is not required to produce executable code so
the prototype compiler is not expected to have such an analysis
capability. However, the compiler can do a much better job of
optimizing when a data-flow analysis is included. One of the
chief uses of data flow analysis would be to improve memory
allocation decisions. For example, if more structure variables
can be held in processor menlory, the number of EM fetches and
stores would be reduced with a likely improvement in throughput.
4.2.3.5 Memory Allocation
Memory allocation is static in the sense that only one program
occupies the FMP at any given time, and that the same variable in
that program always occupies the same memory address if the same
run is repeated. Allocation is dynamic in the sense that space is
allocated to named common areas only when the first program unit
using them is entered; space is allocated to variables local to a
program unit only then that program unit is entered; and these
spaces are deallocated when the last program unit using these
local variables is exited. Hence, the same physical me,,ory area
may successively be allocated to local variables in a number of
program areas. As mentioned earlier, an option would be available
such that no deallocation of unused memory space would occur.
This option would be useful if data values are carried from one
call of a subroutine to the next.
Program and data areas have no relationship to each other. They
would be separately managed. In fact, separate calls on the same
subroutine from different places may have the local working space
of the subroutine allocated to different places in memory, and the
code file for that subroutine will not have moved.
==,'{
.--_.
-.,,;
_.,t-/
.%-
SOURCEFILE
SOURCE
LISTINGS L_
SYMBOL I-
I TABLES /
I COMPILER
I _ PARSER
J DATA
RELATIONSHIP
ANALYZER
TASKFLOW
i ANALYZER
INTERMEDIATE
CODE
I GENERATOR
OPTIMIZER
I (DATA)
OPTIMIZER
(CODE)
I SOURCE
i GENERATOR
I CODE
I GENERATOR
L_
ANSI
STANDARD
FORTRAN
=_ FMPCODE
Figure 4.12 Functional Organization of FMP FORTRAN Compiler
4-51
4.2.3.6 Subroutine Entry and Return
The subroutine entry and return mechanism would be essentially
that of standard Burroughs machines. This mechanism allows the
deallocation of unused memory space rather than requiring the
space of all subprograms to occupy physical memory addresses even
during the time it is not being used. One of the integer
registers would be use(] as a stack pointer. It points to a
"return control word" (RCW) which contains: a) the memory address
of the R(-_ of the procedure calling this one, b) the program
counter setting to which return should be made, and c) the size of
the memory area required by this program. Upon subroutine entry,
the size field, plus th_ number o£ parameters to be passed, is
added to the stack pointer, a new RCW is built, and written into
memory at the new stack pointer. Upon subroutine return, the
stack pointer and program counter are loaded from the RCW. The
parameters that are passed include the base acidresses of any
shared named common areas, and pointers to any variables or arrays
that are passed by name (in FORTRAN, all explicit parameters are
passed by name. However, there is some implicit passing of
parameters by value, as when calling a mathematical function.)
The result of managing subroutine working space as a stack is that
recursivc _ subroutine calls are allowed, even though there seems to
be no use for them in aero flow and weather codes.
4.2.3.7 Concurrency
In the prototype compiler, the only concurrency allowed will be
that of all the instances of a single DOALL. All instances would
be executing copies of the same code file. Execution sequencing
dependent on which domain element an instance is associated with
could be controlled by testing the instance-variables to determine
which element they represent. Nested DOALLs would have the inner
DOALL implemented as an ordinary DO loop.
The hardware is not constrained to have all processors executing
out of the same code file. Thus, in principle some instances of a
DOALL could have one sequence of code, and other instances could
have some other sequence, but this would not be allowed in the
prototype compiler.
Capabilities for operations in which the processors operate asynch-
ronously with no synchronization are not provided. Neither are
capabilities provided in which a few processors are allowed to
execute code separately from the other processors which are using
the c)ordinator for synchronization.
4.2.3.8 Duplexed Computation Mode
A compiler option planned (but not for the prototype compiler) is
to generate the code and controls to execute each sequence of code
twice but with the spare processor switched between executions.
In this mode, all execution occurs in a different set of proces-
sors on the second pass. The results of the two passes would then
be compared as a confidence test for highly-reliable results.
4-52
4.3 OPERATINGSYSTEM
l
i
i
The NASF should have only one operating system, pieces of which
execute on the various portions of the system. In the discus-
sions below, this operating system is called the Master Control
Prngram (MCP)0 The purpose of the MCP is to provide software
support for the following:
i* Scheduling and controlling the flow of programs and files
to and from various processors in the system (including
the Support Processing S_,stem and the ['MP),
2. Initiating staging of jous onto the FMP,
3. Memory management including storage management and data
management,
4, Support of the FMP FORTRAN programs for functions that
cannot be performed in problem mode because of overall
system implications,
5, Support of other functions of the Support Processor-FMP
interface such as performance monitoring, error logging
and operator control,
6, Support of the external environment including interrupt
handling, I/O handling, peripheral control and data
communicat ions,
7. Providing certain system utilities such as dump, and
system log analyzer,
8. Support of diagnostics and maintenance for all parts of
the system.
The development of a system of this magnitude is a major task.
During the study of the feasibility of the NASF, the MCP con-
sidered was based on the existing MCP on Burroughs 700 series and
800 series systems, in particular the B7800. The MCP of this
system has evolved from systems as early as 1960 and is,
therefore, a mature system which would need no modification to
satisfy many of the above requirements. Recently, Burroughs has
been developing the Burroughs Scientific Processor (BSP) as an
attached processor to the B7800. 'I_ne general philosophies of job
flow and task management in the NASF and BSP are very similar.
The MCP described below is therefore based on some of the design
decisions and experience gained in the BSP project.
4-53
4.3.1 Assum_on___ss
The evaluation of the proposed MCP implementation is based in part
on the assumption that the FMP would be designed to operate most
efficiently on tasks with the following characteristics.
i. Data areas up to the size of the extended memory (34
million words).
2o Long running programss a minimum runtime of at least one
second, a typical runtime of several minutes to several
hours.
3. Batch job oriented: user interaction is not required.
Also, as discussed in Chapter 2, a self-managed file system sup-
ports the basic data management functions. This file system is
assumed to not only provide the necessary data storage and retrie-
val functions, but would also maintain and enforce data ownership
and access control.
4.3.1.1 Computational Envelope
An FMP task, once started, is assumed to run to completion within
the high-performance computational and I/O environment o[ the FMP
without requiring intervention of or access to the support process-
or or any of its I/O devices. The computational envelope is the
high-performance environment. In particulars
i. All FMP program and data files are assumed to be fully
contained within DBM while the program is in operation.
All files holding the necessazy input are ass_ned to be
within the Data Base Memory (DBM) before the task is
started.
2. Each FMP program is self-contained as far as resources
are concerned. No dependencies on Support Processor
actions shall occur during the runtime of the program.
Therefore, no Gperator or user interaction would be
permitted during execution of an FMP program. Operators
and users would be able to query the MCP regarding the
status of the job running on the FMP and would have
normal controls such as cancel or suspend execution.
4.3.2 B7800 MCP
The existing B7800 MCP actually provides more functions than re-
quired for the Support Processor System of the NASF. Only those
sections which are of major importance to the NASF MCP are summar-
ized below.
i
_ 4-54
04.3.2.1 Interrupt Handling
The B7800 style systems being considered for the Support Processor
are all interrupt-driven. The interrupt handling section inter-
faces with all the resource-handling parts of the MCP. Interrupts
are caused by the B7800 CPU by the I/O Processor and by software.
Some of the interrupts processed by the interrupt handler are:
i. Caused by B7800 CPU
a. Interval Timer
b. Presence Bit not set (part of automatic memory
management)
c. Invalid Operand
d. Invalid Index
e. Processor-to-Processor Communications
2. Caused by B7800 I/O Processor
a. Operator Request Pending
b. I/O Complete
c. Data Comm. Processor Ready-To-Send
3. Caused by Software
a. Inter-Task or Intra-Job Communication
4.3 •2.2 Memory Management
Memory management methods supported by the B7800 MCP are designed
for implementation of the "virtual memory" concept within the
B7800. Several methods of memory allocation are supported on the
B7800. These methods include Ill]:
i. On demand
2. Working set
3. SWAPPER
All methods use disk as the backup storage device.
4.3.2.3 MCP I/O Handling
Since the MCP is involved in all I/O to and from devices attached
to the B7800, the MCP I/O handling functions are re-entrant code
shared by all tasks running in the B7800 system. These I/O pro-
cedures perform the following functions:
i. Build tile control words necessary to do a physical I/O
operation
2. Send I/O instructions
3. Wait for an I/O operation to complete
4. Notify the associated program to continue
5. Handle physical I/O errors
a. Retry where possible
b. Enter user error routine if declared, or
c. Discontinue the program
4-55
4.3.2.4 Process Control
The job selection process within the B7800 MCP considers the
priority declared by the user, the time the process has been wait-
ing, and the "class" (or system-level priority) of the task. The
process control section supports the following functions in the
B7800.
i. Inititation of tasks required by the user or by the MCP
2. Task scheduling
3. Perform "EC_/EOJ" duties such as deallocatlon and
bookkeeping at End of Task or Job
4. Make administrative log entries
4.3.2.5 Peripheral Control
Peripheral Control procedures of the MCP are responsible for all
peripheral devices on the B7800, except disk. These procedures
perform the following functions:
i. Locate input data files
2. Assign output devices based on availability
3. Maintain and update table of all available units
4. Handle I/O parity recovery such as tape parity and card
reader errors
5. Maintains system-level status such as ready, repair,...
for all physical units including processors, memories and
peripheral devices.
4.3.2.6 Work Flow Management
The processing of the tasks within a users job is specified
through use of an easy-to-use, high-level work flow control
language called WFL [12]. The work flow management software on
the B7800 consists of a controller (which handles most keyboard
input messages and places control records into a Job Description
File), a WFLCOMPILER (which generates object code for presentation
to the Process Control Section based on jobs in the Job
Description File) and a job formatter (which selectively prints
summacy information about the job on the Job Summary sheets).
Most operator keyboard messages are handled through the controller
portion mentioned above.
4.3.2.7 Data Communications
The data communications section of the B7800 MCP is called the
Data Communications Controller (DCC). The functions of the DCC
include:
I. Allocation and deallocation of Data Communications Queues
which are the interface mechanism between object pro-
grams, system routines such as the editors, and tile DCC.
4-56
\,
N
N
_N
NI
_.._
N:
!!
i
]
3. Dynamic reconf igurat ion of the Data Communications
Subsystem
4. Generation and Maintenance of tables used by the Data
Communications Processors
A system called NDL (Network Definition Language) [14] provides a
user-oriented means of specifying network and terminal
characteristics as well as what processing must be performed
during I/O to match the terminal or network characteristics to the
standard forms processed in the system.
4.3.3 Integrat}o,_! of FM___PPTask Management into MCP
FMP programs would exist as tasks within the standard WFL (Work
Flow Language job structure of the B7800. The B7800 portion of
the MCP schedules the FMP task to be staged into the FMP. Once
such a task is initiated, it would run wholly within the
computational envelope without any further B7800 dependence until
the task terminates. The B7800 portion of the MCP may,
optionally, query the status of FMP tasks, or overrid_ the FMP
task-selection decisions.
4.3.3.1 Limitations
Some functions traditionally associated with operating systems are
not provided on the FMP even though they are a normal part of the
B7800 itself. Specifically:
i. FMP FORTRAN is the only language provided.
2. Interactive programs are not supported.
3. No provision, other than direct I/O, will be made for
programs whose total file sizes exceed memory capacity.
4. Delays due to waiting for operator intervention on behalf
of executing FMP programs would be eliminated.
The data base sizes expected are very large. If a job m_x with a
large number of very short jobs with large data bases is encoun-
tered, the file system and paths to and from the DBM would become
a bottleneck. If this occurs, efficient utilization of the FMP
would become difficult.
4.3.3.2 Interrupt Handling
The if_terrupt handling section of the existing MCP would be modi-
fied o include those interrupts caused by the FMP. The major
interrupts from the FMP would be "Task Complete" and "Error State
Pending". Task Complete would be a normal FMP task completion
report. This response would be passed on to the Work Flow Manage-
ment section to determine what task to process next. The Error
State Pending would be the report of an abnormal te[mination.
Status information would have to be scanned out of the FMp to
determine whether the problem is user-related (such as overflow)
or hardware related (such as a failure in that portion of the
systell which is not automatically corrected).
4-57
4.3.3.3 MemoryManagement
No changewould be madeto the B7800MemoryManagementsection of
the MCP. However, input data and programstaging would have to be
initiated by the B7800MCPfor FMPdestined jobs. Only the re-
quests need to be made. The File System actually performs thefunction.
4.3.3.4 Process Control
The process control section of the B7800MCPwould be extended to
support the scheduling and initiation of tasks on the FMP. The
process control section would also maintain FMP log entries and
statistics with respect to workload, job lengths, etc.
4.3.3.5 Work Flow Management
Extensions in the B7800 WFL (Work Flow Language) would prnvide the
following functions:
1. Invoke the FMP FORTRAN compiler and linker.
2. Specify FMP resource requirements for scheduling and
allocation purposes (such as the amount of DBM buffer
area required during FMP task exectuion).
3. Specify job restart point following failure of any
portion of the system.
In addition, the existing work flow management functions which
support operator control of jobs and tasks in the system would be
extended to include tasks running on the FMP. These extensions
would include static controls to give visibility of the status of
a task either queued or active on the FMP. In addition, the exten-
sions would provide means for an operator to alter the priorities
of tasks queued for service and even to force a roll-out of an
active task (for later resumption). Such a roll-out would
normally be only to the Data vase Memory.
4.3.3.6 Utilities
Various utilities specifically oriented to the support of FMP
operations would be developed. These utilities woutd include
various "analyzer" utilities to edit and format dumps.
4.3.4 FMP Portion of MCP
A portion of the NASF MCP would be resident in the FMP. In partic-
ular, the coordinator is the part of the FMP which would execute
the FMP portion of the MCP. The functions provided would include:
i. Interface to the Suport Processor for FMP initiatization,
operator control, task forwarding, checkpoint/restart,
dumps, etc.
2. Schedule and initiate tasks on the FMP from among those
forwarded from the Support Processor. Provide wrap-up
for normal and abnormal termination.
4-58
!3.
4o
5.
Establish connection between an active program executing
on the FMP and the appropriate files in the Data Base
Memory.
Service FMP interrupts such as invalid operand or errors.
Provide the appropriate run-time environment for FMP
FORTRAN execution. This environment would include the
appropriate intrinsics plus mechanizations of time, date,
PAUSE and dump. The run-time environment would also
support code overlay mechanisms, space allocation, and job
roll-in and roll-out.
4.3.5 File Management
An independent file manager provides transparent management of all
files on archive, disk, and in the Data Base Memory (DBM). This
file manager is accessible f£om the FMP, the Support Processor,
and the Users. Thus, the file management system will have capabil-
ities exceeding those required only to support FMP execution.
One of the functions of the file management system will be to
accept commands designating movement of or copying of files from
one place to another. These commands would be utilized to init-
iate the movement of programs and input data to the Data Base
Memory as needed for FMP execution.
The Data Base Memory, and its controller, are considered part of
the File System portion of the NASF although the sole purpose of
the DBM is as a staging memory of FMP jobs and data. Since the
DBM is part of the File System and since the File System provides
data and storage allocation capabilities, the portion of the MCP
on the FMP does not require any filemanagement capabilities.
Another of the functions of the DBM will be to allow certain
functions to be externally enabled. The best example of this
capability would be a request to the File System by the Work Flow
Management portion of the MCP (executing on the Support Processor)
to cause the movement of result files of a particular FMP job back
to the active files from the DBM. This request could be made
contingent on a message from the FMP portion of the MCP to the DBM
controller that che result files are closed and can be released.
Other functions to be provided by ':he file management system will
include:
i. Dynamic allocation and deallocation of space as required.
2. Establishment and maintenance of directories or other
techniques to map external requests (which will be in
terms of the "name" of a file) to the appropriate physical
storage area.
3. Backup and archiving of files based on specified condi-
tions or time intervals.
4. File Security functions which would allow user control
over which programs and/or users would be allowed to read
and/or update their files.
4-59
4.3.5.1 FMP Interaction with File Subsystem
Since the file system is self-managed, all references to data
within the file syFtem would be by name of the data rather than by
direct reference to its physical position. FMP interaction with
the file system occurs at two levels of the system. First, the
coordinator provides the high-level interface to the file system,
in particular to the Data Base Memory Controller. Second, the
Data Base Memory is part of the File System, and as such has an
operational interface to the File System Manager and the rest of
the file system.
The operational interface between the DBM and the rest of the File
System provides the required data paths as well as control paths
to support:
i. movement of files within the file system
2. storage allocation
3. security functions
The interface between the coordinator and the DBM has basically
the same functions as interfaces between the file system and other
NASF subsystems such as the Support Processor and the Users.
Allocation of space within the Data Base Memory is cont_olled by
the File System, not by application programs. The DBM maintains a
table to convert file names into DBM addresses. Thus, the files
referenced by the coordinator are referenced by name rather than
by physical location.
Control of the files within the DBM follows the philosophy of the
rest of the file system. Once a particular file has been opened
by an external request, that file is frozen as far as allocation
is concerned and remains resident (for example in the DBM where
coordinator requests are concerned) until closed. The coordinator
would have the capability of initiating a transfer from DBM to EM
very similar to a DMA (Direct Memory Access). Such a transfer
identifies the name of the file in the DBM and the length and
physical location of the EM area reserved for the transfer.
Operation over this interface can be summarized as follows. When
an FMP task has been requested (in the Support Processor), the Sup-
port Processor passes the names of the files needed to start the
task to the file system. In addition, the FMP portion of the MCP
is notified of the expected arrivals and an entry would be made on
a queue of "pending" job requests. In the meantime, the file
system would be busy transferring the requested files to the DBM.
When the job currently executing on the FMP completes and its
files are closed, the file system begins transferring those files
back to the bulk storage regions. At this time, the coordin-
ator, under control of the pending task list, takes those steps
needed to initiate execution of the next task for which all re-
quired files are resident in the DBM. This task scheduling re-
quires that the status of the file loading into the DBM be avail-
4-60
i_ _"i
H
able to the coordinator. To begin the startup of a job, the co-
ordinator would then open the program code file and request that
it be transferred to some specific area of the EM. Other files
used for standard system monitoring would be opened at the same
time. The FMP task would begin execution after the coo_:dinator
completed broadcast of the code files to the processors.
Not all files would wait to the end of an FMP run to be unloaded
from the FMP. The Support Processor would be able to specify the
destination of expected DBM output files prior to completion of
FMP task execution. The file system would then provide automatic
staging out of the DBM once the file of interest is closed. More
discussion related to this area can be found in Section 5.9 (DBM
Controller).
4.3.6 Job Structure
A job is the only unit of work in the NASF. The job is itself a
very simple program which invokes and determines the relative
sequence of a set of programs. _lese programs constitute a set of
logically related tasks which perform some data transformation on
files. A job is written in FMP Work Flow Language (WFL) and it
runs on the Support Processor (B7800). FMP WFL contins B7800
standard WFL as a proper subset, so any existing B7800 (or B7700)
job can run unmodified on the NASF. The WFL commands are either
simple action commands (RUN, COMPILE) or tests of conditions (IF
SUCCESSFUL-COMPILE THEN...).
4.3.6.1 Organization of a Job
The basic outline of a typical job is constrained by the computa-
tional envelope and LINKER concepts (see Section 4.3.7). The
typical job will contain, in sequence:
i. None, one or more FMP FORTRAN compilations
2. If there is a compilation, a LINKER task
3. Specification of necessary input files for FMP program
4. One or more executions of FMP programs
In addition, any number of B7800 tasks may be interspersed with
the above, such as to generate input files, or to process output
files.
4-61
4.3.6.2 Flow of Job
Figure 4.13 shows a general view of the flow efa job in the NASF.
A job enters at the upper left (BOJ=Beginning of Job). First the
job itself, the Work Flow Language, must be analyzed so the job is
scheduled and finally analyzed on the CPU. The result is a
JOBFILE which controls the sequencing of the rest of the tasks in
the job. If FORTRAN compilations and LINKER tasks are requested,
control remains on the left of the figure. When an FMP task is
specified, that request together with the identification of any
files needed is passed to the File System (upper right of figure).
Once all the files have been staged into the DBM, the task is plac-
ed READY for FMP execution (lower right of figure). Once the FMP
task is complete, the Support Processor is notified so that the
next task specified in the Work Flow can be specified. When all
tasks are complete, the job terminates (EOJ-End of Job-at lower
left of figure).
'NEED_
CORE-
QUEUE
BOJ
GET SPACE 1
-- READY
QUEUE
CPU
I IPROCESSINGI
END
SPS TASK
/--__;'NEEDS
\ I -- DBM SPACE'\ / -----
\ / IF,LEi
START END COORD,
FMP FMP INT
TASK TASK
EOJ
Figure 4.13 NASF Job Flow Diagram
4-62
i4.3.7 Program Load an_dd Overla_ Su_rt
The FMP evaluated would run only one program at a time. No addi-
tional program or data area may be preloaded into the EM or pro-
cessor memories. Although preloading might minimize setup delays
when starting the next task, additional hardware would be required
to support the desired level of security. The Data Base Memory
and its controller allow preloading of programs and data. Secu-
rity can be better maintained at this level since all references
to data in the DBM is by descriptor (or name).
The LINKER accepts object code files from one or more separate
FORTRAN compilations and produces a single load code file, called
the loadfile. In the process, the LINKER asszgns memory locations
to all program instructions and resolves or relocates address
references accordingly.
For the case that the program memory part of the user program is
too Large (i.e. would not fit within the processor memory), the
LINKER supports an overlay facility. With this mechanism, the
user may divide a program into multiple phases and then may
specify which phases share the same memory locations.
For the case that the data part of the user program is too large,
the user may use the direct I/O facilities to and from files in
the DBM. Automatic virtual memory mechanisms were not suggested
for t_lis system since the applications considered during the study
did r,ot require such mechanisms. If a significantly different
workload and application for the system is expected (than the
applications studied), the cost-benefit tradeoffs should be re-
evalu;_ted.
Data is either initialized, uninitialized, or initialized to
"invalid". Initialized segments have their initial contents
present in DBM as generated by the Compiler/Linker. Uninitialized
segme_ts and segments to be initialized to "invalid" are not
present on the DBM. In this case, storage is initialized by the
execution of approprate FMP code.
4.3.8 Operations Support
4.3.8.1 Performance Monitoring
Certa n information will be monitored during NASF operations,
colle,:ted, and reported as part of the system log. Some of this
information is accumulated by the B7800 as part of normal monitor-
ing in the existing MCP. Other information would be collected by
the FMP portion of the MCP. Some of the information that may be
included in such monitoring is:
4-63
4-64
i. Interval timer reading at the time of the report
2. Real time clock at time of the report
3. Count of CN-using instructions
4. Some measure (to be determined) of the processor idle
time
5. A measure of the time that the coordinator only is busy
(i.e. all processors idle)
6. Count of succcessful errer corrections
7. For each error correction, the address and the observed
pattern
8. Time spent in specifiu subroutines
9. Others to be determined
The interval timer in the coordinator would 9e coordinated with
the Support Processor at the beainning of a run.
Other monitoring in the FMP would be task related. Beginning-of-
Task and End-of-Task of FMP tasks, OPEN and CLOSE of DBM files,
and traffic to and from the DBM would be logged. Operator console
system status display would be extended to include FMP tasks.
4.3.8.2 System Initialization
FMP initialization is that process whereby the FMP is transformed
from an indefinite (i.e. any arbitrary) state into a state in
which it normally processes user programs. This process reinitial-
izes all parts of the system. Conceptually, the initialization
process corresponds to a coldstart where not only is the MCP
loaded, but all tables, directories, etc. are initialized.
Initially, no process corresponding to a "coolstart" (where the
disk directory is saved) or to a "halt load" (where jobs are
restarted from the last inactive point) will be implemented. Re-
start in the face of failures needs to be carefully studied since
there seems to be a number of natural points at which execution
could resume after a failure without having to reinitialize. In
particular, while executing all the instances of a DOALL, if one
processor failed, only those instances assigned to that processor
would have to be recomputed in the spare processor. Since the
ENDDO would have occured without successful completion of all
instances, the old values from the start of the DOALL would still
be available. Careful analysis of this sort of a circumstance may
show other "natural" retry points in the system.
Initialization of the FMP itself consists of the following steps:
i. The driver program (executing on the Support Processor)
determines that the B7800 - FMP connection is operation-
al. This connection is a low bandwidth connection via the
Diagnostic Controller (DC) part of the FMP and the Data
Comm Controller on the B7800.
2. The driver transfers the FMP portion of the _ICP to the
coordinator via the DC. The coordinator then begins
execution of its part of the MCP.
3. An initialization phase of the FMP MCP will perform
various initialization functions, including confidence
tests.
4. The MCP will then complete its initialization and inform
the Support Processor.
The FMP is then ready to process programs.
4.4 OTHER SOFTWARE REQUIREMENTS
Although the FMP FORTRAN language and compiler, and the NASF
Master Control Program (MCP) are the key elements of the NASF, a
number of other software capabilities and requirements exist.
These capabilities and requirements might be classified as those
which are supportive to the language and MCP developments and as
those which may provide more general utility of the system.
To support the language, software development cannot stop with the
compiler (both a prototype version and a more final version). In
addition, a system development language must be identified to
support the development of the operational environment. Input-
Output Formatting routines would need to be developed, especially
if a final review of the impact of various system scenarios show
that the Support Processor would then be the appropriate system
resource to provide all I/O support. The program library and
overlay facilities that may be desirable would be supported by a
LINKER or BINDER.
Those jobs in execution on the NASF will need to be able to util-
ize various int[insics, some of which will be resident on the FMP.
These intrinsics would include FMP task initialization (including
EM and PM loading), run-time execution monitoring, and mathe-
matical intrinsics.
Some of the simulation support that would be needed in the develop-
ment of the NASF could be based on work done as part of this
study. Simulators at various levels would be utilized, including:
NASF block-level simulation
FMP simul_tion for timing estimates
Functional simulators for early code development support
Another important area of software would be the systems developed
to support the diagnostics and maintenance of the NASF (which are
discussed in more detail in Chapter 6). These software tools
would include:
Off-line FMP diagnostics which would be initiated by the
Support Processor and exercise the FMP when no jobs were
active.
On-line processor diagnostics to be used both as part of the
off-line FMP diagnostics above and as a means of testing the
spare processors when not actively assigned to user problems.
I,
4-65
• r
Automatically managed FMP confidence tests
Diagnostic generation tools to be available both during
development and initial test of the system, and also as a tool
to allow the Field Engineer to produce new tests as required.
All standard diagnostics and maintenance tools provided as
part of any standard equipment included in the NASF.
Tester Software
In addition to the above capabilities, most of which must be
developed specifically for the NASF, software already exists for
that portion o£ the system which may be implemented with standard
products. For the B7800 Support Processor, a complete set of
languages, utilities, and application packages exist including:
ALGOL
PL/I
FORTRAN
COBOL
BINDER (linker)
CANDE (a text editor)
WORK FLOW MANAGEMENT (operating system)
NETWORK DEFINITION LANGUAGE (for communications control)
4.5 CONCLUSIONS
The implementation of a system such as the NASF is a major under-
taking. However, the software portion of the system studied is a
realistic task to approach since it can be based in large part on
existing software. The major part of the operating system exists,
including the techniques to control an "attached proces_or" with a
computational envelope supporting one user at a time.
The language extensions would be straight forward to implement.
Since the extensions are strongly biased to description of the
problems rather than explicit mapping to the hardware and since
the architecture reflects the structure of the problems, the nec-
essary flexibility exists to allow growth and improved efficiency
over the future of the NASF.
4-66
CHAPTER 5
FLOW MODEL PROCESSOR (FMP) HARDWARE
t
5.1 INTRODUCTION
This chapter contains the results of the past year's study with
respect to the design of the Flow Model Processor (FMP) hardware.
In significant areas, the FMP design presented here is substan-
tially more flexible and more general purpose than the FMP design
of Ref. i. Whereas that FMP was tailored to be efficient on
programs that could be vectorized, with some extention to the case
where the data did not form vectors, the current FMP performs
essentially just as efficiently whether the data can be arranged
in the form of vectors or not. In the present FMP, the 512 pro-
cessors can work together efficiently as a vector machine; they
can be just as efficient when working as 512 independent scalar
processors.
The FMP is capable of execution in a manner similar to lock-step
array machines such as ILLIAC IV or the Burroughs Scientific
Processor (BSP). Simple programs (a copy resident in each
processor), with no data-dependent branching, will produce this
result. The FMP is not limited to this mode of execution however.
It is also capable of performing in the manner of conventional
multiprocessors. Interprocessor synchronization is implemented
via special commands and use of the shared memory (Extended
Memory).
It is expected that the multiprocessor capabilities of the FMP
would be used on array-oriented problems. In particular, all
processors are cooperating on the same job, with each processor
independently executing some small portion of the job. In this
mode of execution it becomes important to have as small a time
penalty as possible when synchronization of the processo:s is
required. The coordinator gives the FMP the ability tc do
array-wide synchronizations in one instruction.
The result is an architecture that is much more flexible than the
current generation of high-performance processors, in that there
is no requirement to vectorize the algorithm. It is also easy to
put a great many processors to work on a single algorithm because
of the degree of interprocessor cooperation available through the
coordinator and the common Extended Memory. Although the aero flow
codes are dominated by vectorizable algorithms, there are por-
tions, such as subroutine CHARAC in the explicit aero code, where
the data dependency is different from processor to processor, and
the independent execution of each processor simplifies matters
greatly. The radiation and physics computations of the weather
codes use the independence of the processors to an even greater
extent.
5"I
Some of the more important design considerations are dlscusse4 in
the following subsection. The sections following in this chapter
review the FMF architecture, briefly llst the system parameters
and describe eech of the major elements of the FMP in turn.
5.1.I Design constraints and Considerations
During the course of major hardware development project, such as
the FMP, consideration of and compromise between many (sometimes
conflicting) requirements must be made. Some of the important
considerations on this project (throughput, economy,
hardware/software compatibility, snd schedule) are discussed
briefly below.
/
<,
i
i
5.1.1.1 Throughput
One major compromise in the design of any processor is between
processor performance and its cost. In this project, the point of
maximum performance per unit of cost is identified on the cost vs.
performance curve for a single processor. Enough of those
processors are built to deliver the required throughput. This
approach contributes to maximizing performance v_ cost for the FMP
as a whole. The above evaluations result in the choice of
high-speed ECL and implementation on large boards.
5.1.1.2 Economy
Although those sections of programs which are vectorizable can be
conceptually implemented on a processor that enforces lock-step
cooperation among all the processors, the hardware required to
enforce such lock-step operation is almost missing from the FMP.
Each processor is self-contained, with as rudimentary connection
to the rest of the machine as the problem requirements will allow.
The MIMD* construction of the machine also simplifies the soft-
ware, both in terms of system software as well as for application
Drogram development.
5.1.1.3 Hardware/Software Compatibility
The overall economy of a system is directly affected by the
hardware support of software requirements. In some cases specific
hardware features may be required to reduce software costs. On
the other hand, when hardware features are not required, system
costs could be reduced by not providing these features. Some
specific considerations on this project include:
(i) The FMP has only one user program resident on it at any
one time.
(2) Data addresses are independent of code locations. Some
degree of dynamic run-time data allocation is done. For
*Mul£iple Instruc£ion Stream, Multiple Data
5-2
iI
I
example, space local to a subprogram is allocated upon entry
to that subprogram, and released upon exit, using a stack
mechanism for allocating space. Space is allocated to a
named common only upon entry to the first program unit
naming that common, and is deallocated upon exit from the
last. Integer registers are used as stack pointer, and as
polnte_s to named common areas. Many machines of the older
generation allocate space permanently, even during those
periods that the FORTRAN 77 specification declares them to
be undefined. In the present case, that will reduce the
size of the problem that can be handled. For e_ample, in
the implicit aero flow code BTRID is a large named common in
subroutine BTRI, and subroutine SMuOTH has arrays SS and CT.
These do not exist concurrently, so processor memory can be
devoted to BTRTD during the execution of BTRI and to SS and
CT during SMOOTH. If space had to be allocated for both of
these all the time, the largest allowable BTRID would be
substantially smaller.
(3) Automatic stack pushing and popping on subroutine entry and exit.
(4) A full set of interrupts both at the processor level and
the coordinator level.
(5) Requests to the Data Base Memory controller, for data in
Data Base Memory, carry the name of the file involved, not
its address.
5.1.1.4 Schedule
Historically, every two years worth of technological development
has [esulted in the delivery of computers that are about three
times more powerful for the same cost. Thus, adding an unneces-
sary year between the design freeze and the delivery of a computer
amounts to using technology that is one additional year toward
obscolescence, and has a penalty of a factor of 3% in computa-
tional horsepower. This trend has slowed recently. Even so, it
is important to use straightforward, low-rlsk designs to achieve
timely delivery.
5.2 FMP ARCHITECTURE
Figure 5.1 shows general organization of the FMP.
elements are:
The major
(i) 512 Processors, each containing a scalar execution unit
and storage for data and program,
5-3
PROC. 0 PROC. 1
CN
BU FF
EU
PM
2.2 X 10 9 BITS/SEC.
DATA
r_ BASE
If MEMORY
DBM
CONTROLLER
V iEXTENDEDMEMORY I EM520
2.8 X 1011 BITS/SEC.(CABLING ANDWIDTH)
CONNECTION NETWORK (CN)
m_2.8 X 1011 BITS/SEC.
PROC. 511
TO FILE MEMORY
TO/FROM
SUPPORT
PROCESSOR
SYSTEM (SPS)
I
COORDINATOR
(CR)
DIAGNOSTIC
CONTROLLER
(DC)
I
TO/FROM
SUPP_JRT
PROCESSOR
SYSTEM
Figure 5.1 General Organization of FMP
5-4
ii
i
(2) Connection Network used to interconnect processors and
the Extended Memory,
(3) 521 Extended Memory modules, whic], hold the main data
base of the program,
(4) Data Base Memory, used as a staging area for jobs to be
scheduled and as a high-speed input/output buffer for
jobs in execution,
(5) Coordinator, used to synchronize the processors, to
interface to the Support Processor, and to run
diagnostics, and
(6) Diagnostic Controller, which allows direct control of
fault isolation in the FMP from the support Processor.
Each processor is self-contained, with integer and floating-point
arithmetic units, its own instruction decoder, its own program
and data memory. Four extra processors are included as on line
spares to help achieve system availability requirements. In
addition, four extra Extended Memory modules are included as on
line spares, again to help achieve system availability
requirements.
5.2.1 General Flow Through FMP
During normal operation, all data and program for the next run
will be loaded into data base memory (DBM) prior to the beginning
of the run. The DBM loading is initiated by the scheduler in the
Support Processor via the File System Controller (these NASF
system elements are described in Chapter 7). The scheduler
initiates a run on the FMP through interaction with the
coordinator (CR).
When the run starts, software in the coordinator initiates the
transfer of code files from the DBM to the Extended Memory (EM).
From there the coordinator causes its code files to be loaded in
its memory and causes the Processor code files to be broadcast to
each Processor. The initialization phase of the program (in the
coordinator) then transfers necessary data to EM. These actions
are automatically inserted by the ctmpiler and the linker. With
data in place in extended memory, and allocated space optionally
initialized to "invalid", and with code files in place in
coordinator and processors, user execution starts.
When user execution is in progress, the coordinator serves as a
high-level "instruction sequencer". Processor tasks are
explicitly initiated and when all processors complete their tasks
(by indicating "I got here"), the coordinator causes the next task
to be initiated in its sequence.
5-5
5.2.2 Changes from Baseline System
The Baseline System of the preliminary study (see Ref. 1 and Ref.
2) had the same basic organization as the system shown in Figure
5.1. The major difference is in the type of connection between
the processors and the extended memory and in the system
implications of that connection.
The Baseline System proposed use of a "Transposition Network"
which allowed flexible access of vectors and array components from
the Extended Memory. The "price" of this vector-fetching
capability was that the processors had to be synchronized at each
Extended Memory fetch time (accomplished by the Control Unit).
The modifications proposed during present feasibility study were
to relax the need for coordination to only the start and end of
concurrent, independent code sections. To accomplish this, an
alternative scheme to interconnect the processors and memories was
proposed which is called the Connection Network (CN). The
reduction in synchronization requirements had the side-effect of
greatly simplifying coordination tasks. These simplified tasks are
handled by a unit now called the Coordinator (CR).
Evaluation of system loading has resulted in some proposed changes
in bandwidth between FMP components. The current bandwidth plans
are summarized on Figure 5.1
5.2.3 Basic System Parameters
No major changes have been made since the preliminary study. The
choice of these parameters was covered in detail in previous
reports (see Ref. 1 and Ref. 2). Following is a summary of the
basic system parameters.
5.2.3.1 Logic Family
ECL is expected to be the preferred logic family. If the final
design were being implemented at this time, Fairchild's 100K
series would be chosen together with compatible memory circuits.
F _al selection of a logic family will be deferred to the
appropriate point in the design cycle in order to gain the most
effective, low-risk components.
Chip counts were made assuming chips projected to be available in
1980. Confidence in this count is supported by the count in a
comparable processor which has been designed using circuit types
available in 1978. See Appendix E of Reference 1 for preliminary
data on this processor.
5-6
ii
i
i
5.2.3.2 Clock Rate
The clock has been assigned a 40-nanosecond period. The instruc-
tion times, given in Appendix C are in terms of this clock period.
These times are compatible with the instruction times derived from
the processor design referenced to in Appendix E of Ref. i. using
ECL 100K.
5.2.3.3 Cabling Methods
The same flat belts used successfully in prior projects at
Burroughs for transmitting high-speed signals with fast rise time
and low crosstalk will be used for most of the interunit cables.
Refe)rence 1 discusses this choice.
5.2.3.4 Pcwer
Power and grounding design details are discussed in detail late);
in this chapter. The primary design considerations are:
(i) A small number of centralized power conditioning modules
that accept raw power f) om the mains,
(2) Switching regulators for efficiency
(3) Defense against faults in the incoming power,
(4) Defense against faults in the FMP,
(5) Noise )'educing grounding methods, and
(6) Non-volatility of DBM contents.
5.2.3.5 Number of Processors
A key decision in the design of the FMP will be the choice of the
number of processors to be implemented. Having designed the most
cost-effective processo), then a sufficient numbe_ of them are
linked together to produce the required throughput )ate. Having
done this, and found that 512 processors is the nearest round
number to match the areo flow requirements, performance analysis
lhen confirms that this approach produces a FMP that meets the
aero flow (and weather) requirements. The p) ocessor design
selected is one that matches the 80ns, 16K-bit by one, static RAM
chips that a)e forecast to be available by the time the FMP is
being designed. This is a fairly simple ECL processor, with 40 ns
clock and 120 ns memo)y cycle.
A faster processor might allow the FMP to be built with 256
processors. This requir _s a faster memory, and therefore is
projected to require a smaller (4K-bit), faster (30 ns) memory
chip. The result is a doubling of the number of memory parts
required. The faster processo); is also estimated to requi) e far
more logic parts, with a net inc, ease in parts count. More parts
5-7
implies more failures, and hence a lowered reliability. Fewer
processors, however, means reduced throughput penalty for those
parts of someapplications where concurrency cannot be found, and
hencesomeextension of the spectrumof applications.
Final decisions will be postponed to take maximum advantage of
components available at the time of design. For example, if the
16K-bit chips were faster than here forecast, a faster processor,
but only 256 of them, might be perferred. If 64K-bit chips were
available at the same speed of the 16K-bit chips here forecast,
these would be preferred to the 16K-bit chips, since one would get
twice as much memory with improved reliability due to the reduced
parts count.
In such a case, it is possible that fewer processors would be
needed to obtain the same throughput. When considering the
16-kilobit RAM versus the faster 4-kilobit RAM, the 4-kilobit RAM
chip would require a 4-fold increase in the number of memory
components. In this case, a trade-off between the reliability
impact of a larger number of memory parts and possible recuced
costs from a smaller number of processors seems to indicate that a
more reliable system is the most cost-effective. It takes 512
processors, at 120 ns memory cycle (projected for 80 ns chips) and
40 ns logic clock, to yield the desired throughput of one billion
floating point operations per second.
5.2.4 Modularity
Although the NASF requirements did not specifically address the
problems of system modularity, the FMP design described below
contains a very small number of standard modules. These modules
are the Extended Memory module, the Connection Network switch
module, and the Processor Module. The Processor Module, in turn,
consists of an Execution Unit Module and a Processor Storage
Module. There is also a Data Base Memory Storage Module.
This modularity allows the potential of configuring smaller (or
larger) systems out of the same parts, with no impact on a user's
perception of the system. In addition, such modularity greatly
simplifies the magnitude of the design task for a system of the
required capabilities and should reduce the fabrication costs
since there will be many copies of a small number of parts built.
5.2.5 Preview of FMP Component Descriptions
Following is a brief description of each of the elements of the
FMP together with a formatted tabulation of pertinent features and
a block diagram of each.
For each element of the FMP, there is a table of characteristics
given. A very short narrative description gives the intended
function of the element in user programs. Source of control is
identified, and the storage capabilities, both capacity and speed
are also given. Connectivity to other elements is defined in
detail.
5-8
!,
The table also discusses the modes of error control built into the
design. Most of these mechanisms are discussed in more detail in
Reference I and Reference 2. The chip count is that projected for
a 1980 design. "TBD" means "to be determined".
5.3 PROCESSOR
The array of 512 processors is charged with the task of executing
the use); computations in the program, namely the floating-point
operations on the problem variables.
The processor executes code contained in its own program memory,
and accepts commands from the coordinator. Certain instructions
are executed in synchronism with the coordinator (and hence, by
implication, in synchronism with the entire array, since the
coordinator expects cooperation from all processors.)
The actions of the processor are delineated by the instruction set
detailed in Appendix C. Figure 5.2 shows the division of the
processor into an Execution Unit (EU), a Processor Memory (PM),
and a CN Buffer (CNB). Table 5.1 provides data on the cha-acter-
istics of the processor as a whole.
5.3.1 Execution Unit (EU)
Figure 5.3 is a block diagram of the Execution Unit (the logic
part of the processor) and the CN Buffer, showing the independent
integer and floating point units, with separate register files for
each. Figure 5.4 is a diagram of the instruction fetching and
overlap machinery. Table 5.2 provides data on the Execution Unit.
Connections to the processor come from the control unit and the
Connection Network. The synchronization signals and the 4-bit
wide command path, and its strobe come from the coordinator. The
data paths to and from the connection network are each accompanied
by a strobe. In addition, each processor is connected to
backplane wiring that expresses its own number.
Of the 129 p, ocessors in a cabinet, any one may be the spare
processor. Suppose processor No. N is the spare processor.Then
the backplane number for processors 0 through N-I is correct but
the backplane number for processors N+I th)ough 128 must be shift-
ed down by one, to N through 127, in order that the processors
being used by the program be consecutively numbered. Therefore,
there is a 1-bit signal coming from the spares-designating machin-
ery which tells the processor whether or not to subtract 1 from
its hard-wired processor number to correct for the location of the
spare. Two bits of processor number a,;e the cabinet numbe_ , and
do not enter into the subtraction.
• i
5-9
I
FROM COORDINATOR I
CU
(VIA FANOUT BOARDS) I
I
I
I
I
COMMANDS
AND SPARE BIT I
SYNCHRONIZATION I
,,=I SYNCHRONI ZATION,
INTERRUPT LINE
PROCESSORI MEMORY (Pal
' lFETCH
LOAD ADDRESS
CN BUFFERI
' TADDRESS
STORE
DATA
1
I
I
EXECUTION UNIT
L ....... PR__.OCESSOR
I
FETCH
DATA
]
I To CN
I From CN
I1 AND SPARE
DESIGNATOR
I
I
I
I
1
1
I
I
1
I
I
I
PROC. NO.
(WIRED IN
BACKPLANE)
Figure 5.2 FMP Process Block Diagram
5-10
i!
Table 5-1. Processor Characteristics
Number in System: 512 (No. of on-line spared: 4)
Func t ion
To execute code wr:itten by FMP FORTRAN compile);, with an
upper: limit on speed of over th):ee million floating point
operations per second. The code is executed cooperatively
with other ' processors and with the coo):dinator.
Mode of Operation
Execution of instructions fetched from processors own memory;
execution of commands issued by the coordinator (diagnostics
only); interaction with EM via the CN buffer:.
StOrage .Capac it ies
32,768
120
static RAM
words
ns cycle (odd-even interlace)
technology
Connectivities
No.
To/From Function or Name Signals
CN Addresses and data to EM , 24
data from coord, and EM,
20 ns per; ll-bit frame
Ist frame timed with
120 ns CN clock
CR Commands plus strobe 5 Synch. with 40 ns clock
CR Status bits to coord. 4 Change on any 40 ns clock
CR "go" f)_om coo):d. 1 40ns pulse
Backplane Processor number: I0
Fanout Spare bit and spare 2
designator;
Wi):ed-in levels
D. C. level
Fanout Clocks 2 40ns clock pulse enable
for selecting every 3rd
one for CN clock
5-11
Table 5-1. Processor Characteristics (Cont'd)
Rel iabil ity/Repair abil ity/Tr ustwor thiness
SECDED checker on data bus
Numerous error checks leading to error interrupts
Parity on microprogram memory
For operation in the presence of failures spare processors
can be switched in, or SECDED can be used to cover up
failures in PM or EM.
Physical
Projected chip count:
Size:
Power:
Additional Constraints:
240
1.2" x 11.5" 27.5 (narrow edge to backplane)
325 watts (including 100w losses in the
switching regulator)
Includes own self-contained switching
regulator
5-12
0
I
I
u
0
r.-I
.r..I
0
U
0
5-13
STAGING
REGISTER
1
"ISSUE" COMMAND
TRIGGER TO PM
START TIME, INT.
START TIME, FL. OT
I
START TIME, MEM I
I
SCOREBOARD
I
I
I
r
INTEGER UNIT
INSTR. REG.
HOLDING
REGISTER
(FOR DELAYED
ISSUE)
FL. PT. UNIT
INSTR. REG.
MEMORY
CONTROLS
END TIME, CURRENT MEM. OP.
END TIME, CURRENT FL. PT. OP.
END TIME, CURRENT INT. OP.
TO DECODING
Figure 5.4 Instruction Fetching and Overlap Diagram
5-14
k_
_=
i
I
J
i
i
;
i
Table 5-2. Execution Unit (portion of processor)
Characteristics
Number in System: 1 per processor
Function
Executes instructions and coordinator commands, accesses
processor memory, and interfaces with CN buffer.
Mode of Operation
Clocked at 40ns clock, which is synchronous throughout entire
system.
Storage Capacities
32 words in addressible registers, a few additoinal
register also
40 ns cycle
ECL technology
Connectivities
No.
To/From Function or Name Signals
PM Data (bidirectional) Ii0 Clocked
PM Address and command 20 Clocked
CN buffer Data (both directions) ii0 Clocked
CN buffer Address path 34 Clocked
CN buffer Controls 5
Fanout Synchronizatoin & status 5
Fanout Commands from coord. 5
Backplane Processor number I0
CN Sparebit 1
Fanout Clocks 2
Fanout Spares designator and
sparebit 2
Comments
ii bits EM no.
23 bits address
5-15
Table 5-2. Execution Unit (portion of processor)
Characterlstlcs (Cont'd)
Reliability/Repairability/Trustworthiness
Contains SECDED checker, microprogram parity,
mentioned under processor
FAiled EU spared out by sparing out entire processor
etc.,
Physical
Projected chip count:
Size:
Power:
I00
About ii" x i0" within processor
125 watts
as
5-16
i8
I
!
i
Error control within the processor includes SECDED on data bus
transfers, parity on words in microprogram memory, and the
assortment of error and bounds checks as listed in the description
of the interrupt register.
5.3.2 Processor Memory (PM[
The Processor Memory (PM) contains data and program within each
processor. Control is from the memory address register in the
processor. There are 32,768 words of 55 bits each consisting of
48 bits of data and 7 bits of single-error correcting, double-
error-detecting code. Data, address, and control connections are
solely to the processor. 16k-bit static RAM chips are used.
Table 5.3 describes major characteristics of the PM.
5.3.3 Connection Network Buffer (CN Buffer)
The CN Buffer accepts address, data, and commands from the EU, and
in response to those commands, may transmit requests for either
store or fetch to a named EM module, may accept data from the CN
and may transmit data to the CN. The CN Buffer accepts commands
from the EU only. The "strobe" or "acknowledge" received from the
EM module via the CN is used as an indication of the success of EM
requests.
Transmissions of data through the CN are synchronized with the CN
clock, a submultiple of the processor clock. All CN buffers are
synchronized to the same CN clock to eliminate time races in the
CN.
Table 5.4 summarizes the characteristics of the CN Buffer. Figure
5.5 shows the shates taken by the CN Buffer controls. The arcs in
the graph of this figure are labelled with the events that cause
change in state. For explanations of mnemonics, see the instruc-
tion set in Appendix C. All eight states in the top of the
diagram are seen as "busy" by the EU. A four flip-flop internal
state register is assumed. The six command lines from the EU
carry different commands plus "go." Three of the requests
(STOREM, LOADEM, and LOCKEM) result in codes being appended to
addresses sent to the EM. In both cases where "go" is shown as
triggering the change of state, an alternative would be for the
"acknowledge" signal, on the 12th line on the data receiving side
of the CN connection, to serve instead.
The 12 lines going from CN Buffer out to CN are ii data lines plus
a strobe that states the data is valid. The 12 lines coming from
CN to CN Buffer are ii data lines plus "acknowledge." Each ll-bit
piece of data is called a "frame". Acknowledge is transmitted by
an EM module upon successfully receiving a request through the CN,
and stays up as long as the connection is to be maintained. The
CN uses the acknowledge to latch up the chosen path, so the
acknowledge is a logic level that stays up during the duration of
the single operation.
5-17
Table 5-3. Processor Memory (PM)
(part of processor; Characteristics
Number in System: 1 per processor
Func t ion
To hold program for execution by the CU, and data to be
fetched in response to that program.
Mode of Operation
Program counter (PCR) and memory address register (MAR)
contains addresses for program and data respectively. The
16k-bit chips assumed by the implementation of choice, allow
the interlace of odd and even modules.
Stor age Capacities
32,768 words
120 ns cycle
NMOS static RAM technology
Connectivities
To/From Function or Name Signals _ Comments
EU address 16 Clocked
EU data ii0 Clocked
EU command 5 Cl oc ked
Reliability/Rep@irability/Trustworthiness
SECDED on all words fetched (SECDED generator/checker is in
the EU)
Detection of illegal instructions, detection of the fetching
of "unitialized" data, detection of fetching of unnormalized
floating point words.
SECDED allows continued operation at reduced reliability in
the face of single bit failures.
Sparing is done by sparing the entire processor.
5-18
_Tr
I
Table 5-3. Processor Memory (PM) Characteristics (Cont'd)
Physical
Projected chip count:
Size:
Power:
130
Ii" x i0" board in processor
100w
5-19
Table 5-4. CN Buffer (per processor) Characteristics
Number in System: 1 per processor plus 1 in coordinator
Function
To serve as an asynchronous interface with the CN, decoupling
the program running in the PR (or the coordinator) from the
access delays of EM and the CN.
Mode of Operation
Three registers hold EM number plus operation code, EM
address within module, and one word of data. EM number
serves as a request for an EM, when transmitted through the
CN. The address register is loaded by the CR, and sent to
the EM module at the appropriate time. The data word has
bidirectional connections both to CR and CN.
Storage Capacities
1 words
40 ns cycle
Connectivities
To_/From Function or Name Signals
CN Data path (bidirectional) 24
EU Data (bidirectional) Ii0
EU EM module no. and EM co 14
command
EU Address within module 22
EU Misc. controls 9
Fanout "busy" 1
Rel ia b il its{/Repa ir abil ity/Tr ustwor th iness
20 ns per frame
120 ns CN clock for
initiations
40 ns clock
40 ns clock
40 ns clock
All data passing through the CN buffer is checked at desinta-
tion for proper SECDED code
Sparing is with the processor of which the CN buffer is a
part.
5-20
IPhysical
Table 5-4. CN Buffer Characteristics (Cont'd)
Projected chip count: 30 chips
Size: NA
Power: NA
5-21
r_
fEU
(STOREM)
f
EU
(FILLEM)
q
waiting to trans- waiting to
mit address plus transmit data
data
ACK
From EM
"GO"
TRANSMITTING TRANSMITTING
address and data
data
TIMED TIMED
IDLE
f
FREM, IREM, IOREM or MREM
EU
(LOADEM) (LOCKEM)
/Bus,. \
|waiting to trans- )
_mit address and J
_eca,ve data /
ACK
From EM
EU
(EMREQ)
waiting to
receive data
TIMED
dress //_ data
TIMED
FULL
"GO"
Figure 5.5 CN Buffer State Diagram
5-22
The CN Buffer also contains the capability of remapping from an EM
module number of an EM module which has been spared out, to a
different EM module number. There are 528 backplane slots for EM
modules in the system, since all four EM cabinets are fabricated
alike. This provides for up to seven spares. Howeve);, the
reliability analysis is based on one spare pe_; cabinet, and only
four registers, in each CN Buffer, are planned for designating
which modules are spare A 4 word associative memory, recognizing
any one of four 10 bit EM module numbers, and substituting spare
EM module numbers for them, is a suggested implementation.
5.3.4 Design Rationale and Changes from Preliminary Study
Size of the processor memory was selected on the basis of the
known requirements of the implicit 3-D codes. In the preliminary
study, the requirements were projected to be 16K words of data and
8K words of prog_'am. In this feasibility study, we have determined
that it is less expensive to use a single uniform memory with no
penalty in performance. Therefore, the Processor Memory (PM) now
contains both program and data and is sized at 32K words.
As design progresses, it may become clear that 64 kilobit RAM
chips will have adequate speed for this application. If that is
the case and if the price is only twice the price per chip of the
16 kilobit RAM chips currently planned, then the design would be
setup to use 64 kilobit chips. In this case a 64K word PM would
result giving benefits both in la2;ger storage capacity and higher
reliability (fewer parts). See section 5.2.3.5 for other
discussion.
Another; area of change from the Baseline System (i) was the intro-
duction of the Connection Network Buffer (CN Buffer) just describ-
ed. The design objective of the CN Buffer is to provide an inde-
pendent logic unit to which the CN-related operations can be
passed while the EU prope); continues processing. Waiting fo,; EM
access, or for CN connections, can be done in parallel with other
processing instead of being in series with program execution. It
is included in response to the asynchronous nature of the CN.
5.4 COORDINATOR (CR)
The coordinator serves two functions. The first is to serve as
the focal point for array-wide synchronizations and array-wide
cooperation. To this end, the coordinator; is supplied with an
array-wide synchronization mechanism, namely the "all processors
ready", "go", "any processor enabled", "any processor in inte_ upt
mode", and so on, as well as an access port to the CN which, in
combination with processor cooperation, allows the passing of a
single piece of data from coordinator o, f_om one EM module to all
processors, or from all processors, combined into a single wo) d to
the coordinator.
5-23
During diagnostics and initialization, the a,'ray-wide cooperation
is imposedon the p)'ocessors by the coordinator, which has a set
of commandsthat are designed to read and write every accessible
register within the p_;ocessor, and generally to exercise any
intraprocessor activity.
The second coordinator function is to run system software,
interface with the support processor, and with the DBM controller
for DBM-EM transfers, and also to be exercised by the diagnostic
controller. Note that DBM access ):equests from the coordinator
are in terms of file identifiers, not addresses.
The host initiates t):ansfers between file-system and DBM using the
DBM allocation map and issuing I/O commands directly to the DBM
controller. No FMP-resident routine is involved in the initia-
tion or completion of these transfers. The DBM controller resol-
ves any potential conflict between these host transfers and a
coordinator-CR-initiated DBM-EM transfer.
Figure 5.6 shows the Coordinator's two connections to the CN. One
connection is a CN Buffer identical to the CN Buffer of the
processor, and is used to access EM. The other connection is
logically a memory port, and is used for injecting data to be
broadcast to all processors, or for accepting data that has been
harvested in parallel from all processors.
The Coordinator can be controlled by commands from the host
(Support Processor) computes issued via the Diagnostic Controller.
This interface is used to suppo);t the necessasy interaction
between the portions of the FMP Ope)ating System resident in the
Support Processor and in the Coordinator. In addition, the Support
Processor can use this interface to initiate maintenance support
procedures.
The speed of the Coordinator is set by the need to execute system
software fast enough not to hold up uses programming. That is,
the Coordinator needs to be executing system software
substantially less than the processors are processing uses code.
Handcompiled samples show that the Coordinator is almost
completely idle during execution of use) code. It will be )ecom-
mended that system software be allowed to execute along with use)
code, letting "all processors ready" and "processor interrupt"
pull the coordinator back to the user's code as required. It is
also recommended that softwa) e conventions allocate certain
coordinato_ registers for uses program use only, and others for
system prog) am use only, thereby eliminating much of the swap
time.
Figu)e 5.7 shows the block diagram of the Coo) dinato) . Table 5.5
summarizes the characteristics of the Coordinator.
5-24
ii
!
1
PATH
FOR EM
FETCHES
PROC 1
PROC 2
IP °cS''L  FI
Figure 5.6
t COORDINATOR t
CN
EM
MOD0
EM
MOD 1
-- MOD 2
o
o
o
I EM.... MOD520
Connections to CN in FMP
PATH
FOR
"BDCST",
"HVST"
TO DBM
5-25
5-26
MEMORY
(CRM)
COMMUNICATIONS
REGISTER
I/O
INSTRUCTION
DECODE
TO/FROM
DC/DBM
INSTRUCTION
BUFFER
E RROR DETECTION
AND CORRECTION
CN
PORT
TO CN (ACCESS TO PROC)
BUS
CN
BUFFER
TO CN (ACCESS TOE_,,
ARITHMETIC
INSTRUCTION
DECODE
INTEGER
UNIT
LOGIC
INTEGER
REGISTERS
MEMORY
CONTROL
I TO CRM
TO ADDRESS
REGISTER AND
CN BUFFER
Figure 5.7 Coordinator Block Diagram
Table 5.5 Coordinator Characteristics
Number in System: 1
Function
Serves as a focal point for the achievement of array-wide
cooperation of processors; serves as the issuing point of
array-wide diagnostics.
Runs most FMP operating system segments, including inter-
action with host, logging of error events hardware
reconfigurations.
Mode of Operation
Executes program. Interrupt mechanism allows switching back
and forth between the two modes of operation.
Storage Capacities
32 registers (possibly more)
40 ns cycle
ECL register technology
Connectivities
To/From
Host
DBM
EM
Function or Name
I/O channel
Descriptor issuance,
controller status
return
Clock EM via EM
fanout tree
EM
CN
CN
CN
Error interrupts from EM
Control
From CN buffer
to EM-like port
Proc. via
fanout
Proc. via
fanout
Command and strobe
Synch
No.
Signals
TBD
TBD
2
2
24
24
24
TBD
TBD
40 ns Clock pulses
120 ns Select every
3rd as CN clock
with CN clock
20ns per frame; starts
synch with CN clock
same but CN clock at
this port is 60ns off
from CN clock at
CN buffer
5-27
Table 5-5. Coordinator Characteristics (Cont'd)
Rel iabil ity/Repair abil ity/Trustwor thiness
Repertoire of error and _e_onableness checks leading
to error interrupt.
SECDED on data bus checks from coordinator memory, from CN
buffer, and from CN to BDCST and HVST. Available for
checking channels to/from host and DBM controller also.
Diagnostic controller has direct access to coordinator state.
Physical
Projected chip count:
size:
Power:
2,000
20 to 30 large p/c boards
Not estimated
5-28
.?
5.4.1 Execution Logic
The Coordinato_: has a numbe_: of semi-independent execution
stations, so that mo, e than one instruction may be in the p1:ocess
of execution at any given time, just as in the processor. The
degree to which overlap, and its additional logic, are wo;;thwhile,
is a function of the amount of system software that the
coordinator is requi,_ed to execute. Using only the two
ae,'odynamic flow models as benchma, ks tells us that no overlap is
)_equi_'ed. Therefo, e the specification of a mechanism of overlap,
as seen in the instruction listings, is only tenuatlve pending
fu*'ther clarification of the computational lead imposed by systems
programming. The units are:
(I) A,:ithmet ic unit,
(2) Memo3 y,
(3) Inte,'face to Suppo]:t P,'ocesso," and DBM controlle) , and
(4) CN buffer.
Instruction timing is given in Appendix D.
5.4.2 Coo_dinator Memo,_ y
The Coord inato," Memory holds both progJ am and data fo, the
Coordinator. It is addressable only f_'om the Coordinator" and
sends all data into the cent,_al data bus of the Coo_:dinator.
The CooJdinato_ Memo)y is identical in electrical design and uses
the same 16k-bit RAM chips as the processor: memo_:ies. The size
,esulting f_om conside_'ations of the flow-model matching study is
32,768 wo3 ds.
Table 5.6 summa, izes the cha*acte, istics of this memory. Note
that it is identical to the P, ocessor Memol'ies in all respects.
As with PM, whe,'e the processor has a SECDED gene):ator-checker fo,:
all memo, y wo_'ds, so he,'e the coordinator has SECDED also.
5.4.3 Design Rationale and Changes from Pleliminary Study
The change f) om the old vecto)-oriented transposition network of
the p):elimina,'y study to the )andom access connection network of
the design cur)ently described has ,'eleased the p):ocessors from
all requi,'ements on regularity of relationship between the data
processed by one p)'ocesso)" and the data processed by any othe).
We now t)'uly have 512 sepa_ ate scala," p_ocesso,'s in the FMP.
Hence, all desi,'e to have a separate, different, scala) p):ocesso)
associated with the Coo,'dinator has disappeared, and the scalar
p):oc4sso) in the cont)ol unit of Ref. 2 has not been carried ore)'
into the Coordinator.
5-29
Table 5-6. Coordlnato[ Memory (CM) Characteristics
Number in System: 1
Function
To hold program for execution by the coordinator and data to
be fetched in response to that program.
Mode of O_eration
Program counter (PCR) and memory address register (MAR)
contain addresses for program and data respectively. The 16k-
bit chips assumed by the implementation of choice, allow the
interlace of odd and even modules.
Storage Capacities
32,768 words
120 ns cycle
NMOS static RAM technology
Connectivities
To/From Function or Name Signals Tlmin_
Coordinato; add;ess 16 Clocked
Coordinator data Ii0 Clocked
Coordinator command 5 Clocked
Comments
Reliability/Repairability/Trustworthiness
SECDED on all words fetched (SECDED generator/checker is in
the coordinator)
Detection of illegal instructions, detection of the fetching
of "uninitialized" data, detection of fetching of unnormaliz-
ed floating point words.
SECDED allows continued operation at reduced reliability in
the face of single bit failures.
Physical
Projected chip count: 130
Size: ii" x 10" board in CR
Power: 100w
5-30
#i i
}
I 4
ii ,
5.5 PROCESSOR - COORDINATOR INTERACTION
5.5.1 Instruction Streams
The FMP is controlled by two instruction streams, which are
created in parallel by the compiler from a single sequence of
source statements. One instruction stream is being executed in
the Coordinator; the other is being executed Dy all processors
asynchronously of each other. Some statements in the source code
result in instructions in both instruction streams. Some of these
joint instructions require that the Coordinator and the processors
synchronize themselves.
5.5.2 S_nchronization
The simplest synchronization that may occur is the WAIT
instruction, in which the processor sets "I got here". The
coordinator is, or will be, executing a SYNC instruction. The
SYNC instruction waits until "all processors ready" becomes true.
"All processors ready" is the 512-way AND of each processors "I
got here" OR NOT "enabled". That is, it is the N-way AND of the N
enabled processors. After seeing "all processors ready", the
coordinator issues a "go" command, received simultaneously by all
processors, which then reset their "I got here" and execute the
next instruction.
When the processor has raised its "I got here" line, but before it
has received a "go" signal, it is said to be "waiting". The "I
got here" line is dropped upon receipt of the "go" pulse.
A processor is not required to be idle while the "I got here" is
set. Commands are provided to set the flag and to allow
processing to continue. However, each "I got here" is considered a
separate event so if the processor continued execution and wished
to identify another "I got here" event, that command must wait as
required for the flag to be cleared by a "go" command from the
Coordinator.
5.5.3 Interface
Table 5.7 contains a list of Processor-Coordinator Interface
signals and identifies their use.
In addition to the above synchronization, the CR also has the
power to transmit commands. The commands are carried on a
4-bit-wide bus accompanied by a strobe line. Many of these
commands are used in the diagnostic programs. Some of these
commands are conditional on the "enable" bit of the processor,
some are unconditional independent of the enable bit. No such
command is used in user-generated FORTRAN programs, after initial
program loading.
5-31
if
! •
i
Table 5-7. Processor-Coordinator Interface
Processor
To or From
Processor Coordinator
"enabled" from
"I got here" from
"Go" to
"Interrupt coordinatoz" from
"Interrupt mode"
"sparebit" to
"spare" to
4-bit Command Bus to
from
"any processor enabled" =
512-way OR of "enabled"
"all processors ready" =
512-way AND of ("I got here"
OR NOT "enabled")
"Go" signal to CN buffer
"processor interrupt =
512-way OR of "interrupt
coordinator" (a bit in the
coordinator interrupt
register
"any processor in Interrupt
mode" = 512-way OR of "inter-
rupt mode" (tested by PINT
instruction
Designation of processor
number of spare procesor
Synchronization and diagnostic
mode command
In addition to the above synchronization, the CR also has the
power to transmit commands. The commands are carried on a 4-bit-
wide bus accompanied by a strobe line. Many of these commands are
used in the diagnostic programs. Some of these commands are con-
ditional on the "enable" bit of the processor, some are uncon-
ditional independent of the enable bit. No such command is used
in user-generated FORTRAN programs, after initial program loading.
5-32
B1
5.5.4 Fan-Out Tree (Coordinator-to-Processors)
A series of fan-out boards are supplied to implement the
Coordinator-to- Processor Interface. Signals and clock fan out
from the Coordinator to the final 516-processor destinations.
From the processors, the signals are combined, so that, within the
Coordinator a single result appears in response to 516 signals
emitted by the processors. For example, the "all processors
ready" signal becomes true at the clock that the last enabled
processor emits "I got here". Another such signal is the
516-input OR of "enabled".
At the processor, some signals are wired per-processor directly to
the last level of fanout board; others are daisy-chained to eight
processors from a single signal pin on the last board. The fanout
boards are pin-limited. Simple buffers with one input pin and one
output pin per signal dominate the circuit count, so hex buffers,
easily available today, will not be improved upon by 1979-1980.
Figure 5.8 shows the Fan-out Tree.
characteristics.
Table 5.8 summarizes the
5.6 EXTENDED MEMORY MODULE
Extended memory (EM) is the "main" memory of the FMP, in that it
holds the data base for the program during program execution.
Temporary variables, or work space, can be held in either EM or
Processor Memory (PM), as appropriate to the problem. All I/O to
and from the FMP is to and from EM via DBM. Control of the EM is
from two sources, the first is instructions transmitted over the
CN, the second is the DBM controller which handles the DBM-EM
transfers.
The Extended Memory consists of 521 on-line modules, and four
spare modules, not used by the working program. Data is allocated
to EM across the modules, with the allocation EM module number =
Address modulo 521 (address is least significant portion) and
address-within-module = address/512.
This addressing mode was chosen as a result of a software
decision. Vectors are an important fetching pattern in the
planned NASF applications (i.e., one vector element to each
processor). It is therefore desirable to design the system so
that vectors of 512 elements will be in 512 separate modules,
reducing memory conflicts and allowing simultaneous access to EM
for all processors. The number 521 is chosen because it is a
prime number larger than the number of processors (512). This
combination then contributes to the above desirable properties.
For a more detailed discussion, see R ef. 1 & Ref. 2.
5-33
CABINET
NUMBER
i COORDINATOR
- 4 COPIES OF 26 SIGNALS
FIRST LEVEL
(CABINET LEVEL)
FANOUT BOARD
4 REQUIRED
lillllll
SECOND LEVEL
FANOUTBOARD
I
, EU
32 REQUIRED
II IIII
8 PROCESSORS
DAISY-CHAINED
PER BELT
512 REQUIRED
Figure 5.8 Processor Coordinator Fanout Tree Block Diagram
5-34
Table 5-8. Fanout (Coo[d-Processor) Characteristics
Number in System: 1
Func t ion
Provide 512-to-i connectivity from processors to coordinator.
Provides i-to-516 connectivity from coordinator to proces-
sors. Provides i-to-129 connectivity from cabinet number to
processors within cabinet.
Modes of Operation
Passive repetition of signals. No registers or program
execution occurs within the fanout tree.
S tot a_e/Capac ities
none words
ns cycle
technology
Connectivities
No.
To/From Function or Names Si@nals
Coord. Synch, status, and command 19
and clock
Timinq
Clocked
Comments
Proc. Synch, status command, 14
clock, and cabinet no.
per processor
Rel lab il it_/Repair abil ity/Tr ustwor thiness
Very low parts count makes additional reliability precautions
unnecessary
Physical
Projected chip count:
Size:
900 (of which 832 are hex buffers of
one sort of another)
36 boards, 4 cabinet boards, 8 row
fanout boards per cabinet
5-35
s.
5.6.1 Basic Characteristics
Each EM module has a storage capacity of 64K words (48 bits data
plus 7 SECDED bits/word).
From each EM module we need a transfer rate and access time consis-
tent with the most economical implementation. An implementation
in 64K-bit dynamic RAM is chosen for availability by 1980. The
low chip count enhances reliability. A 240 ns cycle time of the
memory is projected. Each word carries single-error-correction-
double-error-detection code which is generated at the source
(DBM, CR, or processor) and also checked there, so that transfer
paths are covered by the same error control as the contents of EM.
Figure 5.9 shows the general organization of each EM module. Table
5.9 summarizes the EM characteristics.
5.6.2 Connection Network (CN) Interface from Processors
The commands accepted by the EM module come either from the CN or
from the DBM controller. From the CN, a "strobe" signals the
arrival of a request. The EM module number accompaning the strobe
is matched against the module's own number for error control
purposes. Following the acceptance of the request by the EM, an
"acknowledge" bit is raised by the EM module which locks up the CN
path, and tells the requestor (processor or coordinator) that the
request is being honored.
Following the strobe, and accompanying the address field, will be
any one of four different commands, namely:
(i) STOREM. Data will follow the address; keep up the
acknowledge until the last character of data has
arrived. The timing is fixed; the data item will be
just one word long.
(2) LOADEM. Access memory at the address given, sending the
data back through the CN, meanwhile keeping the
"acknowledge" bit up until the last ii bits frame has
been sent.
(3) LOCKEM. Same as LOADEM except that following the access
of data, a ONE will be written into the least
significant bit of the word. If bit was ZERO, the
pertinent check bits must also be complemented to keep
the SECDED code correct. The old copy is sent back over
the CN.
(4) FETCHEM. Same as LOADEM except that the "acknowledge"
is dropped as soon as possible. The coordinator has
sent this code to imply that it will switch the CN to
broadcast mode for the accessed data. '_he data is then
sent into the CN which has been set to broadcast mode by
the coordinator, and will go to all processors.
5-36
All of the above commands may arrive at any CN clock cycle.
cept for the clocking, there is no synchronism imposed.
Ex-
%
'°
t
i
t
ONE-WORD
BUFFER
CONTROL FROM
DBM CONTROLLER
',
I
Figure 5.9
MEMORY
CHIPS
(64K WORDS
BY 55 BITS)
MAR FOR PROC.
OR COORDINATOR
MAR FOR DBM
EM Module
I EM NO.(WIRED
INTO BACKPLANE)
t PARALLEL
TO
BYTE-SERIAL
t..... -( I
Block Diagram
CONTROL
FROM CN
5-37
Table 5-9. Extended Memory Module (EM module)
Characteristics
Number in system: 521 (No. of on-line spares: 4)
Function
Serves as main memory for array processor; serves as shared
memory among the processors.
Mode of Operation
Storage Capacities
65,636
240
MOS dynamic RAM
words/module x 55 bits (48 data)
ns cycle
technology
Connectivities
No.
To/From Function or Name Signals Timing
CN Data, Addresses, 24
Commands
20 ns per frame
ist f_ame synch.
to 120 ns clock
DBM cont.
via EM
fanout
Read, Write, to DBM 36 Clocked by CN clock
Rel lab ili ty/Repair abil ity/Tr us twor hiness
All data is covered by SECDED.
are contained in the elements
destination of the data.
The generators and checkers
that are the source and
A parity checker checks parity on the module-number/address/
op-code fields received through the CN.
Physical
Projected chip count:
Size:
Additional constraints:
85 (55 memory chips)
One Ii" x i0" board
Each EM module may be self-con-
tained for power regulation, just
as is the processor, to simplify
power distribution.
5-38
l
i
t
!
5.6.3 DBM Interface
In addition to the above, there are two commands that result in
cycle-stealing for EM-DBM transfers. These commands and their
addresses come from the DBM controller:
(i) Read from address to one-word buffer, and
(2) Write to address from one-word buffer.
The one-word buffers are loaded from, or unloaded to, the data bus
to DBM under DBM controller control.
A transfer rate of 20 nanoseconds per word (50 million words per
second) is achieved on this bus. Every 20 nanoseconds, the
controls associated with this bus increment EM module number.
Decoding logic for this module number is found in the EM fanout
tree, where it is made conditional on the designation of spare EM
module. The EM address space has 512 words at each EM address to
simplify the address computations within the program. For writing,
the EM modules are cycled after 512 words are loaded into the
1-word buffers, and those EM modules whose buffers are flagged
"full" write, while the nine others do not. For reading, all 521
EM modules are caused to cycle, but only the 512 valid words at
this address-within-module will be transferred to DBM.
Incrementing of module number, for loading or unloading the 1-word
buffers, is done in modulo 521. The address-within module is
broadcast from the DBM controller, and is incremented every 512
words transferred.
5.6.4 EM Fanout
A second fanout tree, similar to that between the coordinator and
the processors, comes from the DBM controller and carries requests
for EM cycles from that controller.
It also carries EM addresses, and the two clock lines to the EM.
Because of the requirement for addresses, this one has
substantially more parts.
From the DBM controller comes address, command, clocks, and timing
for loading or unloading the one-word buffers in the EM module.
From the EM modules comes an "error" signal. Spares designation
is done by controlling processor access, not by switching EM
modules in and out, so no spares designation signals are in this
tree. Figure 5.10 shows the EM Fanout Tree. Table 5.10 summaries
the characteristics of this Fanout Tree.
5-39
5.6.5 Design Rationale
Size of the EM module is in direct response to Ames' statements
about the size of the data base of the aero flow codes they expect
to run on the NASF. Speed of the EM module is derived from
observations about the number of EM accesses necessary to support
a given quantity of floating point operations in the processor.
The range of floating point operations per EM access was observed
to typically lie between 5 and 20 for the aero flow codes. The
resulting EM access times were seen not to impact the running time
of the entire aero flow codes, although some minor sections of
those codes were noticeably slowed by an accessing EM, at the
currently designed speeds.
It should be noted here that advances in semiconductor memory
technology may make it feasible to consider use of 256-kilobit
chips instead of the current 64-kilobit chips. Also in the
future, 64-kilobit chips can be expected to be reasonably faster
than the current chips. Therefore, depending on when final design
decisions are made, a tradeoff could be made between the following
options:
(I) 256K words/module x 521 modules (large storage), or
(2) 64K words/module x 521 modules (current size but
faster).
The considerations will be that option (i) would have much larger
on-line storage with no impact on performance projections. Option
(2) assumes existing plans for data storage requirements, but the
faster parts would result in a faster system and increased
throughput (note that here one could consider fewer processors and
lower cost to get the requested throughput).
5.7 CONNECTION NETWORK (PROCESSORS TO EXTENDED MEMORY)
A flexible means of communication between the processors and the
Extended Memory modules is required. In order to achieve a
reasonable compromise between performance and hardware cost, the
connection network is based on the "Omega" network (ref 6) rather
than on the crossbar switch. The resulting network provides a
path from each processor to the EM module selected by that
processor. The network does not have a central, global control.
5-40
' t¢
I •
i
i
i
i
i
t
.oo.°,..To.1 I.-co.T..LL..I
36 REQUIRED
SECOND-LEVEL
FANOUT (gPERCABINET}
.._._ TOTAL 16 X 4 + 2 X 27 = 118 SIGNAL_
I I
Figure 5.10 EM Fanout Tree Block Diagram
5-41
Table 5-10. EM Fanout Characteristics
Number in System: 1
Function
Distribute addresses and commands from DBM controller to EM
modules. Distribute clock from DBM controller to EM modules.
Mode of Operation
Passive logic, no flip-flops, no execution of commands.
Stor a$e Capacities
none words
ns cycle
technology
Connectivities
No.
To/From Function orName Signals
DBM cont. Addresses, control 34
Coord. Clocks 2
EM mod. 16 above 36
Reliability/Repairability/Trustworthiness
Low parts count makes additional reliability precautions un-
necessary in comparison to the reliability of the rest of the
FMP.
Physical
Projected chip count:
Size:
Comments
22 bits of address
per module
1250 (of which 116 are hex buffers of one sort
or another)
36 boards
5-42
i:
4.
t
I
The requirements put on the Connection Network are that it have
the immediate response to connectivity requests (tens of nano-
seconds), that it have on the order of NlogN parts, as does the
Omega or the Benes network instead of the N z parts of the crossbar
switch, and that like a crossbar it be able to provide all N paths
simultaneously when the requests for connection are a p-ordered
vector, and that it be able to handle almost all N paths at once,
with only modest delay imposed on a few of the requests, when the
requests do not form a p-ordered vector. All of these can be
accomodated in a design based on the connectivity of the Omega
network as shown in Fig. 5_Ii.
The network has been designed with the added capability of
processor to processor connection and provides transfer paths to
and from the coordinator. Although the path connectivity of the
network cannot be externally controlled, special communications
modes (such as "broadcast") are available under control of the
Coordinator.
The following discussion requires the use of certain definitions,
as follows:
A "p-ordered vector" is a set of requests in which the EM
module number being accessed by processor N is equal to (d +
pN) modulo 521, where d is called the "offset", and p is the
"skip distance". When p is also the distance between
successive addresses, p has also been called the "stride".
"Stride" modulo 521 equals "skip distance."
A "p-q-ordered vector" is defined in Appendix B, as a set of
requests from processors 0 through 511 such that processor
number i is requesting from memory module M i given by M i =
(a + p*i + q*((i-b)DIV k)) modulo 521. In this equation, k
is the length of each piece of vector, p is the skip distance
within each piece, and q is the additional skip distance
between pieces. The constant a is the offset. The constant
b is the amount by which the first piece is short, since the
first piece might be a leftover from some previous fetching
of a p-q-ordered vector. A simple example is shown in
Appendix B.
5.7.1 Functional Descr iption
The Connection Network (CN) has two modes of control. First, in
the normal mode, the CN establishes connections to the Extended
Memory under control of the Processors. Second, the Coordinator
may use the Network for a number of special purposes as described
below.
5-43
PORT PORT
NO, NO.
0
0 1
1
4 4
5 5
6 6
7 7
EM
PROCESSOR SIDE
SIDE 8
8
9 9
10 10
11 11
14
14 1,5
15
Figure 5.Ii 16 x 16 Omega Network
5-44
In the normal mode, a "request" establishes a two-way connection
between requesting processor and the requested EMmodule. The
establishment of the connection is acknowledgedby the EMmodule.
The "acknowledge" is transmitted to the requestor. The release of
the connection is initiated by timing internal to the EM module.
Only one request at a time arrives at a given EM module. The CN,
not the EM module, resolves conflicting requests.
The following states of the connection network are established on
command from the coordinator.
(i) "Broadcast from coordinator". One word of data is
distributed from the coordinator to all processors.
(2) "Harvest to Coordinator". One word of data,
representing the AND or OR or some mixture thereof, of
the words presented by each of the enabled processors,
is received at the Coordinator. Expected to be used by
diagnostics with just one processor enabled.
(3) "Broadcast from EM". The EM module previously
identified by a request from the Coordinator, will have
the data being emitted by it broadcast to all
processors.
(4)
"Wraparound at stage n". Each pair of processors whose
number differs by the bit at the nth bit position shall
be connected, and data shall be swapped between them
using the bidirectional path established. Processors
whose port numbers are separated by 2n swap data.
(5) Diagnostic control
(6) "Null". Respond to processor requests normally.
The connection network appears to be a dial-up network with up to
512 callers the processors, possibly dialing at once. There are
512 processor ports, 521 EM module ports, and two coordinator
ports, one of which "looks like" a processor port, and the other
like an EM port. Processor ports, and the coordinator port, are
capable of accepting "requests".
The time required to set up each path is commensurate with the
access time of EM, which in turn is designed to be suitable for
the number of EM accesses observed in the applications studied.
In the CN design described in thi_ section, the minimum time to
set up a connection is 120 ns. This time is achieved for most
cases, including specific cases that are important in the aero
flow and weather code applications studied.
5-45
5.7.2 CN Com_lexit_ Considerations
The basic Omega network provides only one possible path from a
given processor-slde port to an EM-side port. A network of this
sort may experience blockage, especially during periods of heavy
simultaneous usage by all processors. A number of methods were
considered to reduce the probability of blockage and to increase
the effective throughput through the network. Three of these
methods will be summarized below.
The "natural" size (in terms of numbers of ports on each side) is
a power of 2. Since there are 521 + spares + Coordinator connec-
tions on the EM-side, the network can be considered to be a 1024 x
1024 network. This additional size is the first method of
reducing blockage. Half of the processor-side ports are unused
and slightly less than half of the EM-side ports are unused.
Thus, there is immediately a factor of two red:,ction in the
maximum number of requests for service to the network. By
spreading the active elements across all available ports,
potential blockage is further reduced by reducing the total number
of nodes in the network where blockage is physically possible, as
explained in section 5.7.3 below.
The second method, a simple duplexed network, requires
approximately twice the number of parts than the network just
described. In this case, the network is duplexed (i.e., there are
two copies) in order to provide alternate paths. Then requests
that may be blocked on one Omega network may find a path on the
second (which carries only those request blocked on the first
"layer").
The duplexed network contains exactly twice as many 2 x 2 switch
nodes and twice as many node-to-node connections (one set on each
layer). In addition, a small amount of extra routing logic is
needed on the processor- side and a small amount of arbiter logic
is needed on the EM-side of the network.
A third method, a duplexed network with interlayer paths has even
less blockage. In this method the total number of connections in
the network is the same as the second alternative just discussed.
The corresponding pair of 2 x 2 switch nodes (in the two Omega
networks or layers) is replaced by one 4 x 4 switch node. Connec-
tivity is provided between layers at each node, thus greatly
increasing the total number of possible paths from a processor-
side input to an EM-side output. The resulting network appears
the same as the Omega network (Fig. 5.11) but each connection
drawn actually is two independent connections and each node is a 4
x 4 switch rather than a 2 x 2 switch.
_: 5- 4 6
i1
A threefold investigation has gone into the optimization of the
CN. First, a functional simulator was written, in which a variety
of test cases could be generated, and the resulting sets of
reque3ts submitted to the simulated CN to observe the behaviour.
The processors in this simulation had a queue of up to five
requests each. The number of processors making a request could be
varied. There was provision to test 48 different CN design
options.
Second, a statistical evaluator was written, in which the
percentage of conflicts for random permutations on the inputs
could be computed for a variety of different EN design options.
For the CN option that they both handle, namely the single-layer
Omega network, the evaluator and the sin_ulator give identical
results.
Third, an analytical evaluation of the CN behaviour, for
particular CN design options, was carried out. Each of these is
discussed in more detail in the Appendix B and Appendix H.
Either the simply duplexed network, or the duplexed network with
interlayer ports would be acceptable. The latter has the least
blockage, but a somewhat higher parts count. In the evaluations
made, both the simple duplexed network and the duplexed network
with interlayer paths had i00 percent success in fetching vectors
in two of three directions. The simple duplexed network had a
success rate of 7 _ percent in the third, or "hard" direction while
the other, more complex network had a success rate of 87 percent
in this case. (Success rate is defined to be the percentage of
requests which connect immediately to EM-side outputs with no
blockage. The experiments concerned had all processors active.)
In either design, if vectors with preferred skip distances are
presented to the network, 100 percent of the requests are
satisfied immediately. A skip distance of one is always satisfied
i00 percent. Table 5.11 is based on the simple duplexed network.
5.7.3 Processor and EM Connection Mapping
The Connection Network has 1024 ports on the processor side,
numbered from 0 through 1023, and likewise on the EM module side.
Because potential blockage in the network is a function of
destination address and the origin of requests, the allocation of
processors and EM modules to ports of the network becomes an
important concern. The allocation function is called a mapping
function. The mapping function serves to map processor number
onto input port number and EM module number onto output port
number.
5-47
Table 5-11. ConnectionNetwork (CN) Characteristics
Numberin System: 1
Function
To serve as a dial-up network whereby each processor can
access any EM module in a time comparable to the access time
of the EM module. To serve also as a broadcase network where-
in the coordinator or any EM module can broadcast to all
processors. To serve as the converse of broadcasting in
which teh coordinator can harvest a single word from all
processors. To furnish some minimal processor-to-processor
communication.
Mode of Operation
Individual 2 x 2 switches are combined into a locally control-
led network. Control of the individual 2 x 2 node is gener-
ated within itself from the signals presented to it, without
regard to the state of the rest of the network. There are no
latches or flip-flops within the CN, it is entriely combina-
torial logic.
Storage Capacities
words
ns cycle
technology
Connectivities
To/From
Proc/coord
EM mod./
coord
Function or Name
No.
Signals Timing Comments
Data path, processor 24 20ns/frames 513 such
side 120ns major connection
timing
Data path, EM side 24 same 522 such
connection
coord Control 2
Rel iabil ity/Repair abil ity/Tr ustwor thiness
All data passing through the CN is covered by SECDED,
resulting in the correction of sing]e transient errors, and
the detection of all hard errors.
The internal redundancy of paths will provide that function
continues for some, but not all, of the failure modes of the
CN.
5-48
!!
i
i
J
!
i
I
Table 5-11. Connection Network (CN) Characteristics (Cont'd)
Physical
Projected chip count:
Size:
39280
Power:
5-49
Port = 32 x (EMno MOD 512) + 1 for 512_EMno _ 527
Within each cabinet, for the 256 ports in that cabinet, EM modules
are attached to all even ports 0, 2, 4, etc., through 254, and to
odd ports i, 35, 39, and 103. In four cabinets, there are 512 +
16 ports thus addressible, allowing up to seven spares. Any spare
can be used in place of any failed EM module, up to four total
limited by the remapping in the CN buffer.
Furthermore, the remapping described above is done with simple
wired-in shifting, and ORing. The substitution of spare for bad
EM module is done by substituting one EM module number (521, 522,
523, or 524) for the EM module number of the failed module. The
conversion from EM module number to port number is fixed, mostly
jsst by wiring, in the CN buffer, as shown in Figure 5.12.
5.7.4 Hardware Aspects
5.7.4.1 Clocks and Synchronization
Requests are made in synchronism with the "CN clock". The CN
clock is a submultiple of the processor clock. The CN clock will
be synchronous and simultaneous across all requesting ports (512
processors plus coordinator). The acknowledge from EM module is
received within a single CN clock period, since the CN clock
period is greater than the roundtrip delay through the network.
Since EM can be accessed only in synchronism with the CN clock,
the EM cycle time will be a multiple of the CN clock.
Processor clock is distributed in synchronism to all processors.
A signal which selects every mth processor clock pulse as the CN
clock is also distributed from the clock source, but the timing
reference is carried on the processor clock itself.
The values computed from projected characteristics are 40 ns for
processor clock period, 120 ns for CN clock period, five CN
clocks, or 600 ns, from the beginning of one request for read
access (5OADEM) to the beginning of the next request for read
access from the same processor, given that there are no blockages
in the CN itself. For store access to one EM module, the CN
buffer must wait 360 ns before accessing any other EM module, for
either read or store.
5.7.4.2 Switch Element
Figure 5.13 shows the logic in one switch element. The control
logic occurs once and the sets of AND-OR gates are each repeated
twelve times as indicated on the diagram.
5-50
£M NO.
IN
CN BUFFER
BITNO, BIT NO.
0 0
I
- J,4 , | , i i 4
4 5
6 " '
9 J NO.
PORT NO.
TO
CN
Figure 5.12 Mapping of EM Module Number to CN Output Port Number
5-51
4
J _._j. , . _' !9;r_.: _. _7:.._,,:,.,._ .,'_,7i.._.,_,<;:, 7, ",e,-x> :'_ ..'L_.._, ', :o_;:.:" . - a,/.,_.' _:,,: _o._,_,_-.:.: !:':_' _ _., _. :., °' _ .,,'-'_ : P._ ';7: "_ _ _ ',_ ': " "'"' "=' ".:_._ " _ .4" _: .... r., .. ;;.<,.
%PORT 1
PROCESSOR
SIDE
PORT I
E.M.
SIDE
PROCESSOR
SIDE
CONTROL
LOGIC
COMMAND
E.M,
SIDE
PORT 2 PORT 2
-X- GENERATES ALL SIGNALS LABELLED "E"
Figure 5.13 CN Switch Element
5-52
gG
@
tL_
For the processors, several mappings have been tried or proposed_
I. processors 0 through 511 attached to ports 0 through 511.
2o Same, except processor number is bit-for-bit the reversal
of port number. That is, processor number II0000000 is
attached to port number 000000011; processor 1 is
attached to port number 256; and so on.
3. Processors 0 through 511 connected to even numbered ports
0 through 1022.
4. Same as 3, except for the bit-for-bit reversal. That is,
processor 110000000 is attached to port number
000,9000110; processor 1 is attached to port number 512,
and so on.
5. An assignment of processors to ports such that the
connec_ivities of the omega network will make connection
cyclically among the processors, processor N being able
to transmit to processor N+I.
6. A random assignment of processors to ports.
Similar assignments can be made on the EM module side, except that
the EM modules from number 512 to number 520 must be allocated
also.
Mappings 1 and 2 can be eliminated by the observation that all the
Processors, or EM mdoules, are crowded up into one part of the
network, creating additional conflicts. This expectation is
validated by the results of the CN simulator using these mappings.
Mapping number 6 can be eliminated by the argument that other
mappings give much better results for the frequently used
p-ordered requests and p-q-ordered requests than they do for
random requests. The best operation seen with the simulator
suggests that mappings 3 and 4 should be used, one on either edge
of the network. The best case actually simulated was processors
using mapping 4 and EM modules using mapping 3 on the simple
duplexed network. Call this the "baseline" mapping function.
With the above choice of mapping functions, the known frequently
used requests are serviced with i00 percent or near-100 percent
request success, and random cases are serviced. The simple
duplexed network shows an average of 77 percent nonblocking in the
network for random requests. The duplexed network with interlayer
paths shows 87 percent nonblocking, and also represents the rate
of request success seen on random p-ordered vectors. Success rate
is 100 percent on requests with skip distance = i. Success rate is
near i00 percent on p-q-ordered vectors with skip distance = 1
within the pieces of vector.
5-53
...... k
For any mapping, there is a bad case, a permutation in which only
32 out of the 512 accesses requested are granted per BM cycle. It
is desirable that this case be one that is not expected to occur
(note that the bad case is not a catastrophe, it is merely a
excess access time for one memory fetch). For mappings 3 and 4,
the bad case is when the EM accesses desired are the bit-for-bit
reversal of a sequential index. This case actually occurs once in
one of the several ways to program a fast Fourier transform.
Hence, investigation of mappings is expected to continue,
including mapping No. 5, which moves the bad case to some more
random permutation, and allows an interesting data exchange
pattern for the SHIFCN instruction. However, the Fast Fourier
transform, with one transform executed in parallel across the
array, does not OCCdr in any aero flow code or weather code
evaldated. The FFT's in one weather code are executed serially,
512 FFT's in 512 processors, and do not contain the bit-for-bit
reversed subscripted parallel fetch request.
It might be noted here that requests within the Connection Network
refer to a CN port, not to a processor or EM module number. There-
fore all mapping must be done external to the CN. Mapping of a
processor number to a port _ is implications only for the wiring
pattern that is used to let _ _ch processor know its own number.
Off-line spare processors are inhibited from making requests to
anything other than spare EM modules. This is done in the CN
buffer logic of the processor. In addition, the CN buffer logic
is responsible £or mapping EM module number to CN output port.
This implies that the provision for spare EM modules must be
accommodated in the remapping from EM module number to CN port
number, since the ports will not be physically relocated when a EM
module is spared. In every CN buffer, four port numbers will be
caught and replaced by substitute port numbers.
The suggested mapping from module number to port number is as
follows:
First, put the most significant bit of EM module number at the
least significant end of port number. This gives
Port = 2 x EMno
and would give
for 0 <_ EMno _ 511
Port = 2 x (EMno MOD 512) + 1 for 512 _ EMno _ 520
This last formula is unacceptable as it puts all nine high-order
EM modules into the first cabinet. Port numbers are rigidly
assigned to cabinets, one quarter to the cabinet. The second
formula may be modified as follows:
5-54
!%
?
:,[
The simple duplexed network would be packaged as follows:
The 512-wide, 10-deep by 2 layer arrangement of nodes can be
partitioned into 2-wide, 2-deep by 1 layer subsets in which every
subset is like every other subset. A 1-bit-wide slice of this
subset will fit on a 24-pin package as a single chip of moderate
complexity, 24 x 256 x 5 x 2 such chips will implement the entire
CN. This choice yields a total of 57,440 packages, all identical,
all in 24-pin packages. One observes that the use of the data
lines is half duplex, not full duplex. If bidirectional data
lines were used, a more complex chip, handling both directions of
data on the same line, would still have the same pin count. Strobe
and acknowledge, however, could not be combined. The result would
be 13 packages per node, instead of 25, and the total chip count
of 13 x 256 x 5 x 2 would be 39,280.
In a 40-pin package, the subset two nodes wide by two levels, and
both layers deep could be accommodated, so that exactly half as
many 40-pin packages would be used, or 28,720 packages without,
and 19,640 packages with, bidirectional data lines. In any of the
four cases, the control logic is replicated on each chip to reduce
pin count. The next-to-largest of these various projections is
used in Table 5.11 (which shows the CN Characteristics) to be
conservative without complete pessimism.
A complete new chip design is not planned. Rather a gate array
implementation is likely.
5.7.4.3 Packaging
Most of the CN is packaged within the EM cabinets, an identical
subset of the CN being found in each of the four cabinets. Note
that in Figure 5.11 the Omega network to the right of the second
lever of logic is exhibited as four identical Omega networks of
one quarter the width. Thus, the 80 percent of the CN past the
first two levels of logic is found in the EM cabinets.
(If the processor cabinets had enough room, and if processor
numbers are assigned to cabinets in the correct pattern, the same
partitioning of 80 percent of the CN to processor cabinets can
also be achieved. An interesting puzzle is to devise those
assignments of processor number to cabinet that allow all of the
CN to be distributed among the processor and EM cabinets, with
none of the CN assembled in any one central location, such as
colocated with the coordinator and diagnostic controller.)
5.7.5 Desi@n Rationale and Changes from Preliminar_ Study
The CN seen here represents a major change, and a major
improvement, over the transposition network described in Ref. 1
and Ref. 2. The transposition network was at its most efficient
only for 512-1ong vectors. For p-q-ordered vectors, the access
time went up proportional to the number of pieces into which the
vector had to be divided (five pieces for a 100 x I00 x 100
5-55
problem in the third, or hard direction). Conditional state-
ments within DOALLs resulted in complex code in those processors
that were trying not to execute anything; they had to pretend to
be fetching and storing to EM like other processors in order to
keep the synchronizations straight. Analysis in the compiler was
therefore also complex.
With this connection network all these complexities disappear.
Each processor is completely independent of any other processor.
The language has been simplified, since restrictions on
conditional statements and labels nave been removed. The compiler
has been simplified, since the conditional LOADEM and STOREM
operations are no longer necessary, and a lot of address
calculation that took place at compile time, or which had to be
allowed for in the old control unit, is not needed with the
present connection network.
The CN chip count represents another cost/performance
optimization. For performance, a 516 x 5?8 crossbar switch, with
no conflicts, and all accesses being granted on the first attempt
at request, would be preferred. However, the crossbar switch has
275,088 crosspoints, whereas tLe CN has 40,960 crosspoints (four
in each 2 x 2 bode). This is just 15.2 percent as many
crosspoints, reflecting a large ratio in hardware also. Despite
this huge hardware saving, the CN has I00 percent success in
fetching vectors in two of the three directions, and a success
rate of 77 percent (or 87 percent if the alternate design is
taken) in the third, or "hard" direction.
A second optimization of speed vs. hardware cost occurs in the
path width of the CN. At Ii bits per frame, we need a path that
is 12 signals wide, _nd takes five frame times to transfer a whole
word. At 20 ns per frame this means that the delay due to
serialization of the data word is an additional 80 ns, and
dividing address into two characters adds 20 ns. The delay due to
access time in EM is on the order of 200 ns (actually, it is yet
to be determined, and the recent TI announcement o£ a 64k-biz RAM
makes it appear that EM will be faster than projected in ref. ]).
The delay due to round trip transfer time through the cables and
logic of the CN is estimated at 120 ns. Thus the i00 ns added
dglay due to serialization of address and data is small compared
to the 320 ns or so minimum possible access time. In Reference i,
a path width of 8 pits was chosen as adequate. This has been
increased to ii bits in order to present the request in fully
parallel fashion; the request being the port number on the EM .:ide
of the CN.
A third optimization concerns the time it takes to compute the
control of the CN. With unlimited amount of time for computing
the setting, the Benes networ_ can produce a set of paths such
that all processors have their requests g_anted immediately, i00
percent of the time. The Benes has fewer components that the CN.
Unfortunately, we are trying to make connections in nanoseconds.
Opferman and Tsao-Wu, Ref. 7, show that the amount of computatio.1
5-56
required to find a non-blocking setting for a Benes network is on
the order of N 2 computational steps, or Nlog N if an associative
memory is available. This is certainly intolerable to compute at
run time when the data is being fetched, and in our opinion is
intolerable at compile time also. Furthermore, the computations
impose synchronization onto the processors, since one new request,
asynchronously added to existing set of latched up requests
requires a whole new control computation. Hence, we have opted to
search for suboptimum, but fast, control determinations, having
each node making its own determination of its own setting on the
basis of [ocally available information only, and ignoring the rest
of the CN.
5.8 DATA BASE MEMOBY (DBM)
Data Base Memory (DBM) is the window in the computational envelope
of the FMP. All jobs to be run on the FMP are staged into DBM
Delete running both program and data, all output from the FMP is
staged through the DBM. DBM can be used by the programmer to back
up EM for those problems whose data base is ]arger than EM.
Contrul or the data base memory is from a I)BM controller,
(described in the next section), which accepts corn,hands both from
the coordinator for transfers between DBM and I;M, and from the
host for transfers between. DBM and the file system.
The design chosen is a C('D _,emory based on 256k-i,it ,,;_i_:.s which
ar___ projected to be avai]ab_e in the 1980 peYiod. ^_te_nativ<s
are discu::sed in the sectioll on de--ign rationale. J.ldUt, 5. 14
_'.-ows the g_nera] organization _[ the Data Base Memory. The
d,,t,_i ts ,)F ri,i.; organization are <]iscu,,,sed in mor(_ detail i_l the
he×', sub..,,_:t i,_l,.
Tb: i)rimlry us, _,I Lne I;BM is as a staging area rot joos g_>:ng to
aN.l coming _r,;m tl:,' FM_. It can also b;_ us<_d as a sour,<, [oI
ov,_ laying dat,_ and program into the _M[" for large jobs. It 'b,÷',
?os:{ible to transfer less than a full block, but all transfers
must begin at a bl.ock boundat'y.
5.8.1 Genezal Storage Characteristics
The qeneral organization of the DBM is a controller to]ether with
a general CCD chip array, used as the primary storage area, a
number of block-sized buffers, used for speed matching on data
transfer interfaces, and error controls.
The design described here is based on the assumption that 256k-bit
CCD chips will be arranged in the form or 128 shift registers of
2,048 bits each. It is also assumed that the shift late o[ the
devices will be 2.5 MHz.
5-57
10 MHz
CHANNELS
TO FILE
SYSTEM
1
2
3
l,,
I (
i
64K WD
MEM
64K WD __.
MEM
_-- 440 WIDE _¢_-
CCD
CHIP
ARRAY
_r
64 CHIPS
DEEP
I DATA REGISTER(440 BITS)
_._ ARALLELDATA BUS
55 BITS @ 20 NS
or
110 BITS @ 40 NS
f
4_ SECDED
ERROR CORRECTION
FROM
EM
TO/FROM
FILE SYSTEM
CONTROLLER
<_
RESULT
DESCRIPTOR 1
FI LE SYSTEM
DBM
I/0 DESCRIPTORS
CONTROLLER
EM ACCESS REQUEST
RESULTS
BM-EM TRANSFERS
B>
TO/FROM
COORDINATOR
Figure 5.14 Data Base Memory Block Diagram
5-58
I!I
i!
i
!
i
i
i
DBM files come from and are moved to the SPS file management
system. Over 99 percent of this traffic is expected to be simple
moves from DBM to disk pack. Twenty M-bits/sec on this path
yields large safety factors over the traffic actually required,
even after making allowance for the fact that short jobs will be
bunched in prime time. The four channels provide 20 Mbits/sec
with 5 MHz disk transfer rates and 40 mbits/sec with the i0 MHz
rates available in recently announced products.
No buffering is needed on the EM side beyond the one-word buffers
in each EM mod_,_e. These one-word buffers, and the 240 ns cycle
time of the E_ modules, together ensure that the DBM controller
never need wait for an EM response.
DBM-EM transfers have priority over CN servicing in the EM
controls. However, there is little interference with processor
accesses to EM. For example, when transferring from EM to DBM,
one EM cycle loads 512 of the pe_-EM-module one-word buffers, and
then waits for 12.8 microseconds before another EM cycle is
required for the DBM transfer path.
Table 5.12 summarizes the characteristics of the DBM.
5.8.2 Soft Error Control
As a background job, the DBM controller periodically initiates an
access for the purpose of reading the contents of a block and
rewriting that same block with all detectable errors corrected,
since errors are spontaneously cre, _ed in CCD memories at a low
rate. These errors are apparently caused by background radiation
effects on the CCD chips, discharging the little capacitors by
temporarily ionizing the oxide. The rate of periodically
initiating access can rationally be determined only after getting
the vendor's specification. Preliminary Fairchild data indicates
that one should scrub through the entire DBM every seven minutes.
At that rate, this background access would be initiated for a new
block every 55 ms. Error scrubbing accesses will not queue. If
one is delayed beyond its 55 ms time slot, then the whole cycle
will slip to 7 minutes plus 55 ms.
5.8.3 Design Rationale and Changes from Prelim_',_ry Study
The major change from Ref. 1 & Ref 2 was the reorganization of the
internal structure of the DBM CCD storage array to allow higher
bandwidths to and from the EM modules and to and from the file
storage system.
5-59
Table 5-12. Data BaseMemory(DBM)Characteristics
Numberin System: I
Function
To serve as staging area for FMP jobs; to serve as memory
extension for FMP jobs that will not fit into EM and PMs.
Mode of Operation
Stor, s in blocks only. }{as access to support processor file
system on the one side, and to the EM on the otheL- side. _. DBM
areas may be used by the file system.
Storacig_Capacities
134,217,728 words 131072 words
400 ns cycle shift rate plus 280ns cycle
256k-bit CCD technology 64k-bit RAM
dynamic MOS
Connectivities
To/From Function or Name
No.
Signals Timing Comments
Support Processor
EM
DBM controller
Data channel
Data channel
Control
TBD 40 mega-
bit:_/sec
TBD 40 m_qabl ts/
sec.
TBD TBD
Reliability Repairability Trustworthiness
All words covered by error correcting code.
Errors are periodically removed by reading, doing error
correction, and rewriting.
Sections of DBM can be locked out by software, so that
function can be provided by the remaining working portions.
Physical
Projected chip count:
Size:
Power:
29160 (28160 memory + 100(I control
and misc.)
176 boards of 166 chips each
10kw operating, 1 kw standby
5-60
i
I
Two major data transfer paths exist, one to the EM and one to the
disks of the File System. The desired transfer rate to and from
the Extended Memory (EM) is 40 M words/sec. To accomplish this,
the DBM storage area will be organized 440 chips wide for parallel
emission of eight 55 bit words by 64 chips deep.
The natural block size with 2,048 bits in each shift register, the
eight words in parallel delivering a block of 16,384 words, is
adopted. There are 8k blocks for a total of 134,217,728 words.
Error correction is a SECDED, probably the modified
Hamming-plus-parity implemented by Motorola's 10,163 chip.
Since the array of CCD chips is 64 x 440, the DBM is constructed
in a number of physical modules; each one 8 x 440 chips. Cards
are 20 bits wide, 22 cards per module. The repair philosophy is
to pull and replace individual cards, and the degraded mode of
operation would be to run with one or more modules missing, and
the operating system would have to be told to avoid assigning any
data to that space.
There are eight block-sized buffers, which stand between the CCD
storage and the host interface, in order to reduce the
interference with DBM-EM transfers produced by simultaneous
DMB-file system transfers. They also serve as timing buffers to
the file system's disk packs, eliminating the need for block sized
buffers elsewhere in the data channel. These buffers are
contained in two memory modules constructed of the 64k-bit dynamic
RAM chips used in the EM modules.
J
After the transfer of a block to or from the CCD store, the shift
registers rest at the starting position until shifting is required
by the refresh requirements, or until the CCD store is again
addressed, whichever occurs first. The store will be periodically
addressed for error control reasons, see 5.8.2 below. Therefore,
whenever there are several requests for transfer pending at once,
or when they occur with sufficient frequency, the access time is
essentially zero to the first word of the block. For transfers
arriving at random times, far enough apart in time so as not to
interfere, the average access time is given by:
2
Tar = 1/2 (T b /T r)
where Tb is the transfer time of a single block (0.82 ms) and T r
is the time between refreshes. T r will be in the specification of
the device, and is expected to lie between 1 ms and i0 Ins. There-
fore, the average access time for random data at low usage, to the
first word of the block, has an upper bound which is expected to
lie between 0.67 ms and 0.067 ms. As traffic increases, the
access time is mostly due to interference between competing acces-
ses, while the contribution due to delay in the memory goes to
zero.
5-61
The DBMdesign seen here is the result of comparing a number of
different devices. The other possibilities include:
Magnetic bubbles. Rejected because the bandwidths would
requile the reading and writing of thousands of bubble chips
in parallel, and also because of the inherently greater
complexity of bubble systems. Each bubble chip requires
several support chips such as drivers, sense amplifier, etc.
Rotating magnetic storage. With enough heads in parallel,
and a fast enough rotation late, magnetic rotating storage
can supply the DBM requirements. However, the programming
becomes complicated by considerations of data organization
and access time. Blocks want to be very large to amortize
the large access times over the high transfer rate
requirements. For example, to get full transfer rate from a
I0 ms disk requires blocks that cover the entire track, or
blocks i0 ms long. If full transfer rate is 40 million words
per second, the blocks are almost half a million words each.
For some purposes this is a severe restriction.
64k-bit CCD's. 64k-bit dyanmic RAMs will be preferred by
almost all equipment designers over the shift register CCDs.
With the recent appearance on the market of dynamic RAMs, it
is to be expected that the 64k-bit CCDs will disappear.
64k-bit dynamic RAMs. These would make an acceptable back-up
DBM design. With the increased cost would come a measure of
increased performance and freedom from the hardware-defined
block structure.
One last possibility should be mentioned for the future. The same
device fabrication, tooling and lithography te.'hniques which are
expected to allow the development of 256k-bit CCD chips can be
expected to result in 256k-bit dynamic MOS RAM chips within a year
afte_ the CCD chips are available. Enough advantages may accrue
from the use of these chips in terms of increased performance and
freedom from a fixed, hardware-defined block structure that these
RAM chips would be used in the DBM design.
5.9 DATA BASE MEMORY (DBM) CONTROLLER
The DBM controller interfaces two environments, the FMP internal
environment and the file system, since the DBM is the window in
the computational envelope. DBM allocation is under the control
of the file management function of the support processor. The DBM
controller has a table of that allocation, which allows the DBM
controller to convert names of files into DBM addresses. When the
file has been opened by an FMP program, it is frozen as far as
5-62
!! allocation is concerned, and must remain resident in DBM untileither closed or abandoned. For open files, the DBM controller
accepts descriptors from the coordinator which call for transfers
between DBM and EM. These descriptors contain absolute EM
addresses, but file names and record numbers for the DBM contents.
The DBM controller therefore has two main elements.
programmable controller and second, hardwired channel
accommodate the data transfers.
First, a
log ic to
The software response time of the DBM controller shall be less
than i00 microseconds to Coordinator requests. This demands that
the conversion from file name to address be simple table lookup,
and also that the response of the DBM controller to Coordinator
commands be essentially instantaneous; i.e., either the normal
state of the DBM controller is waiting for an Coordinator command,
or Coordinator commands have a priority interrupt within the DBM
controller.
The actual channel controls for transferring a block of data are
independent of the controller that does the table lookup and
handles the exception conditions. Address counters, limit
registers, and limit comparators are separately implemented, not
programmed, because of the high transfer rates involved. There
are five such channel controls, one per host channel, and one for
the EM interface. The entire bandwidth of the EM channel is
devoted to whatever single transfer is being effected at a given
time.
Operation is as follows. When an FMP task has been requested, the
support processor passes to the file manager the names of the
files needed to start that task. In some cases existing files are
copied into newly named files for the task. When all files have
been moved into DBM, the task starts in the FMP. When the task in
the FMP opens any of these files, the allocation will be frozen
within DBM. It is expected that "typical" task execution will
start by opening all necessary files. During the running of a FMP
task, other file operations may be requested by the user program
on the FMP, such as creating new files and closing files.
EM space is allocated either at compile time or dynamically during
the run. In either case, EM addresses are known to the user
program. DBM space, on the other hand, is allocated by the file
manager, which gives a map of DBM space to the DBM controller. In
asking the DBM controller to pass a certain amount of data from
DBM to EM, the Coordinator, as par_ of the user program, issues a
descriptor to the DBM controller which contains the name of the
DBM a_ea, the absolute address of the EM area, and the size. The
DBM controller changes the name to an address in DBM. If that
name does not correspond to an address in DBM, an interrupt goes
back to the Coordinator, together with a result descriptor
describing the status of the failed attempt.
5-6]
Not all files will wait to the end of an FMPrun to be unloaded.
For example, the number of snapshot dumps required may be data
dependent, so we maywish to create a new file for each one, and
certainly we shall want to close the file containing a snapshot
dumpso that the file managercan unload it from DBM.Whenthe FMP
task terminates normally, all files that should be saved will have
been closed by the FMPprogram. The strategy that supports
restart has not been detailed.
The file managermay choose to leave read-only files in place in
DBM,on the chance that the sameread-only file may be asked for
by more than one task.
5.10 DIAGNOSTICCONTROLLER(DC)
The diagnostic controller provides a channel whereby the Support
Processor or logic Jan at the maintenance panel, can impose
diagnostics upon the FMP. The strategy behind the dia.]nu.,tics is
that any portion of the FMPcan be set to somearbitrary state,
and thc_ncaused to execute somefixed function or execute for some
fixed amount of time, and that the resulting state ca._ be
observed. The Diagnostic Controller's access is dir,_ct to the
coordinator. Access to the processors is indirect, in that the
coordinator has direct access to the processors, and the diagnos-
tic controller manipulates the coordinator. Chapter .5, in dis-
cussing the diagnostic programming, discusses these relationships
in more detail.
The output of the diagnostic controller is a set of co,hands to
the Coordinator and the DBM controller. These commands are yet ho
be determined Jn detail but they are of the general type of the
following examples:
LOAD REGISTER R
READ REGISTER R
EXECUTE the instruction presently residing in the program
register and then halt
HALT all operations, possibly by suspending the clock
INITIALIZE a predetermined subset of
predetermined state (probably all zeroes)
registers to a
The input of the diagnostic controller comes from either the
support processor or from a maintenance terminal. The inpu= can
cause the diagnostic controller to emit single commands, or to
emit a series of preprogrammed commands. In order to emit meuning-
ful sequences, and to collect the results of those sequences, it
is envisioned that the diagnostic controller contains a mini or
microprocessor. A test control language will be provided.
5-64
The diagnostic controller is a debuggingaid, a system integration
ai_, and is used only as a fall-back mode of operation during
maintenance. System initialization, upon power up ur other cold
start of the FMP may also use some of the DC capabilities for
initialization of selected registers and loading of bootstraps.
5.11 POWER CONSIDERATIONS
The power supply design for the FMP will consider the following:
- A small number of centralized power _'onditioning modules
that accept raw AC power from the mains.
- Switching regulators for efficiency
- Defense against faults in the incoming power
- Defense against faults in the FMP
- Noise reducing groundinq muthods.
- Non-volatility of DBM contents
A power supply system that takes all these features into account
is described in this section.
Total power for the FMP is estimated at 250 kw, based on an
average of 0.8w £or each of the 200,000 circuit packages and on 65
percent efficiency in the power supply system.
i
!
5.11.i AC Modules
The block diagram of the power supply system is shown in Figure
5.15. Raw AC power is supplied to six places (labelled "i" in the
figure), namely:
- The maintenance panel, which also contains the central
power system control
- The DBM power system
- Four identical AC modules.
Each of the six places to which raw AC input is supplied contains
an AC voltage monitor. The design intent is to shut the machine
down for high line or low line that is potentially damaqing to the
machine, and to send a one-bit message to the maintenance panel
and the support processor for low-line conditions that may cause
garbling of data.
5-65
5-66
AC INPUT
(30 208V
PHASE TO
PHASE)
©
AC MODULE
NO, 1
I
$
qAc+°°°LEtNo.3
AC MODULE
+'4 1 [
I
I
I
DBM ' ' I
POWER ' i
A i
I
!
I
I
I
I
POWER
I
CENTRAL
CONTROL
PROC. CAB
NO, I
® _
® _
EM CAB I
NO, 1
EM CAB
NO. 3
DBM
CAB.
PROC.No.CAB.2 J
PROC.No.4CAB. I
__J
EM CAB.
NO. 2
EM CAB.
NO. 4
DBM
CONTROLLER
CAB.
®T
COORD AND
CN. CABINET
@_-[ MAINTENANCE
DISPLAY
Figure 5.15 FMP Power System
,_ ,,i
= ,
,,.. ....... _°+ +. ,_..-,,0<, v: :, +"%?' x,m,,+: ._. ,,,,:.'.'+;_-_,o"+',_'+.-'-+o _.,°.............+;+',. +,,_.+;,,:+_,. ,++'+,.0+...:+_t.,.+c,+,'t_.i, _.,', :++++t+,._;.,=.. ......+,++'.m'"_-++.,,+=++,. , :+:+/.
_':+"_,++,"_":'+=o'P,_'#_: _'+"+.++'++,_,'i'+"+.'+++'_+'+'+"'_i+ °., _:°_°+°: err++++, +'._'+.:,+_' .'+ +,' , _'+'P+ ++"'_+'_ _''+_+ +:':: _+..... .....
'_.':_.; _'':+_..Lf" + O ,...+__'+ ....
i
!
i
i
i
i
Out of the maintenance panel's power system comes various DC
voltages (labelled "3") for the maintenance display and the
central power control. These include +5 at 20 amps for logic and
LED drivers, +!2 at 0.Sa, and -12 at la, plus a switched i@ llSv
AC for the CRT which has its own self- contained supply.
The AC modules receive "turn-on", "turn-off" signals from the
central power system control, and send "fault" signals back to the
central control. Each AC module supplies up to 250 a at 158 volts
DC (labelled "2") for the switching regulators attached to it.
The AC module also combines the fault signals from its attached
power supplies into a "cabinet fault" line, and shuts down for any
perceived faults. It contains line filters. In complexity, it is
similar to the AC module of the B-6700. Power efficiency is
between 96 percent and 99 percent.
The requirement that the FMP power system ride through
undervoltage transients, and tolerate voltage spikes from the
mains, influences the design of the power control modules. A
transforme[less rectifier in the central power control module,
with switching regulators distributed around the FMP, is a system
inherently tolerant to undervoltage sags and transients, and
impervious to spikes. In addition, the switching transients of
the regulators tend to be soaked up by the filter capacitors at
the control module's rectifier. Whether either a motor-generator
set or battery backup is needed, would depend on actual line
characteristics at Ames. If the line characteristics are known
before the design is carried out, the system can be designed so
that the expense and inefficiency of the motor-generator set can
be eliminated.
The DBM power unit provides 30 amps at 158v for the DBM controller
logic supplies (labelled "5"), and a separate line ("4") for 35
amps at 158v for the memory chips of the DBM. There is also a
stand-by mode in which 8 amps at 158v is supplied to the memory
cabinet from batteries during power outages of up to 15 minutes.
(15 minutes is selected on the basis that that is long enough to
save all of DBM on a single disk pack through a single disk
channel. The resulting 316 watt-hour requirement can possibly be
supplied by an ordinary sealed lead-acid battery.) The DBM power
unit also contains logic to handle fault situations, and the same
line filters that are in the AC modules.
5.11.2 Other Power Supplies
Besides the seven units described briefly above, there are within
the cabinets the following:
516 processor power supplies each contained physical]y
within its own processor. Each one is a 70 percent or
better efficient switching regulator supplying 40 amps at
5v, and 0.5 amps at -12v.
5-67
44 supplies at 5v, 160 amps. Except for the lower power
]evel, these are similar to power suppliec built by
Burroughs for PEPE. There are eight in each EM cabinet,
and four in the Coordinator-connection-network cabinet,
and eight each in the DBM controller and the DBM memory
cabinet. These are switching supplies at 75 percent
efficiency.
6 supplies at 12v, 2 also working from the 158v out c_f fhe
AC modules. These are used in Coordinator, DBM
controller, and EM cabinets for +12v and -12v for various
purposes.
>:ach of'. the supplies above contains remote voltage sensing,
app_oF_riate over-current sensing, current limiting or fold-back,
over-voltage and under-voltage sensing.
5.11.] Grounding Considerations
Grounding is an area of de:_ign in which even qualified electrical
an,i ,_]_ctron]c engineers sometimes propagate false myths. Some of
the con[usion is due to failing to distinguish between various
[unctions of the conductors called "ground", which in any given
case may or may not be at the same voltage, and may o[ may not be
the: s_Jme conductor. Some functions are:
- Neutral in an AC distribution system.
- i_arth, or an external zero voltage reference.
- Safety ground, enclosing the equipment in ouder to prevent
shock hazards.
- Shields, enclosing electrically active circuits in order to
l_rev,_nt transmission or reception of interfering electro-
ma_]netic signals.
- Referenc_e voltage. The signal w)itages in the equipment are
measured with respect to the reference voltage. Reference
voltage is often called "logic ground".
- Return paths for currents.
Some _]etails resulting from these considerations are:
The ground return from backplane to power supply is never
used as part of the path that connects one backplane ground
to another backplane ground.
Every module has its logic ground tied to chassis, so that
there will be no floating grounds when the modules are
tested as stand-alone modules. These _ies may be resistors
if unwanted ground currents would b_ set up by direct
connections.
Every single-ended signal which traverses froi_ the area c_f
one backplane to another is accompanied by a wire conductor
for the return current of that signal, and the retur_ con-
ductor is connected to reference voltage at all points at
which the sign_l is either generated or used.
5-68
¢-,
l;ii
!ii!
:?L
7T
;ii
5.12 CIRCUIT AND PACKAGING TECHNOLOGY
5.12.1 IMPLEMENTATION TECHNOLOGY UPDATE
5.12.1.1 SUMMARY
The semiconductor industry has continued to improve both device
density and performance since the previous implementation technol-
ogy submission. Smaller device geometries have been achieved in
production with the application of Electron Beam processing tech-
niques. The initial utilization of the E Beam tool has been in
the mask generation area where smaller geometry and more rapidly
generated masks have been produced. This advantage coupled with
projection exposure of wafers as compared to the use of contact
masks and plasma or dry etching has enabled higher precision
devices to be generated in a production environment. Line widths
are predicted to diminish to under one micron. The priority of
devices to which the new processing technology is applied has been
first in the memory area and second in the micYoprocessor area.
Microprocessor availability in the 16 bit logic density area has
increased from just a few, to a selection of a half dozen or so
with performance estimated to be in the PDP 11/45 class or
greater. Direct address capability has expanded from a 16 bit
limitation of 65K to a 16 megabyte level.
During the initial manufacture of large (65K) CCD Memories a
higher than expected random failure rate was observed. The
failure mechanism was later identified as being caus?d by alpha
particles which modified the charge being transported, thus
destroying the information stored in the memory. Solutions were
developed for greatly lowering this failure _ate by reducing or
eliminating the major source of alpha particles and providing a
shield layer on the chip. The major source of alpha particles was
reported to be in material used to package the CCD chips.
In the very high speed area, gate arrays were becoming an
interesting alternative for achievement of dense logic implemen-
tation. The economy of using gate arrays is dependent on quantity
of the devices required for the systems to be produced. Basic
arrays exist at Motorola and Fairchild in the nigh speed ECI, area.
A gate array exists in the proprietary Burroughs CML circuit
family (BCML).
Memories anticipated to be available in the 1979/1980 time frame
have already been delivered on a sample basis to selected manu-
facturers. These include the 65K dynamic RAMS and 16K static
RAMS. CCD 65K bit memory circuits have been delivered for incor-
poration into CCD memory modules. Work has begun in definition of
256K bit CCD and 256K bit RAM with expectations of availability in
the 1980/81 time frame.
%
5-6'9
u
Gallium arsenide efforts in the high speed sub-nanosecond logic
area have continued at a number of manufacturers' £acilitles.
Specification circuit configurations for the gallium arsenide
MESFETS are being reviewed along with development of production
procedures to manufacture these devices. Speed power estimates
vary from i00 picosecond propagation delay range to about 300
pico-seconds with power dissipations varying from about .08 - .3
milliwatts per gate.
Effort is being expended in utilization of the CMOS SOS type of
circuit implementation. At the Solid State Circuits Conference in
1978 the general discussion seemed to indicate that CMOS SOS gate
density problems would be somewhat overcome with the tighter line
width. The attractiveness of the CMOS SOS circuit for NASF
applications is the projected lower power dissipation o[ gates not
memory in the CMOS LSI circuit.
The specific implementation approach to be selected for the NASF
FMP must be postponed as long as possible due to the dynamJc
developments occurring in the semiconductor technel _y area. At
present, the bipolar _CM or CML candidates look the most promising
[rom a performance/risk point of view. Although developments in
higher speed galliam arsenide devices are progressing, the risk
involved in such an infant technology does not seem to warrant the
advantages gained in higher speed.
DuYing the zurrent contract some additional information in both
ECL arrays and BCML circuits has been reviewed. Some
characterisitics of the ECL voltage compensated arrays as well as
information on BCML are included [n the following.
5.12.1.2 ECL Ar ravs
Motorola has announced the MECL 10K Macroceli Array that consists
of 48 macro cells with 32 interface circuits and 28 output
circuits. All cells can have series gating. Structured cells are
pr_defined into logic elements. Interconnect channels are 12 x 12
for 9 macro cells. The macro cells consist of functional circuits
which are interconnected to produce larger portions of logic. The
total number of channels available for interconnection is
approximately 108 x 94. Inter,_al gate delays anticipated are
approximately 900 picoseconds. A maximum of 1.3 nanoseconds is
expected. The maximum power dissipation if all cells are used is
anticipated to be approximately 4 watts. An. equivalent gate
complexity up to 750 gates can be realized on the array. The
average gate power is projected to be 5.3 milliwatts. High drive
outputs can be _chieved at 8 of the interfaces. A capability of
drivinq a 25 oht_ line exists at these points. The die size is
approx{mately 210 x 230 m_Is. It is anticipated that the
semiconductor :hips will be placed in a 68 pin leadless package.
Some proposed connectors exist for the 68 pin package.
5-70
jJ
5.12.1.3 BCML
The BCML-2 (Burroughs' Current Mode Logic) family is a Burroughs
developed circuit family intended for use in Burzoughs' systems.
The family consists of SSl, MSI and LSI circuit types, gate
arrays, register riles, ROM, PROM, EAROM, and RAM. All devices
have on-chlp voltage and temperature compensation. This assures
constant logic levels and constant threshold, hence constant noise
maroins. It also assu_es constant propagation delay over the
entire operating voltage and te,lperature ranges. Two types of
power supply are specified. Logic circuits use -2.7V + 30% or
-4.8V + 25%, while memories use only -4.8V + 25%. A]_ devices
have o_-chp output resistors which serve to-source-terminate 50
ohm transmission lines. On-chip Test and Diagnostic (T&D)
monitors are used to detect opens and/or shorts of any logic net
and loss of power supply voltages to any circuit chip.
The salient [eatures of the BCML family are given in Table 5.13.
5.12.2 Packaging
5.12.2.1 General
Final choices of _ackaging technology can be deferred until the
system _lesign is nearly complete. However, for performance and
tell.ability analysis, scheduling and cost, preliminary selections
must be made. Basic high speed (ECL) packaging technology has
been developed over th6 past decade thau provides high performance
and reliability at quite reasonable cost. The _nanufactur ing
tooL::_, and assembly and test procedures, are all fully developed.
This technology includes a family of specified and use quallf_d
comuonents and hardware. Advances in this area are under
continual study. The cur rent status and per formance
characteristics of this t_chnology are discussed in the folloving
section;3.
5.12.2.2 Printed Circuit Assemblies
Multi-layered printed circuit assemblies provide a straightforward
approach for the packaging of standard commercial dual-in-line ECL
circllits. The six layer 16 inches by 18.5 inches assembly used by
Burroughs on the PEtE and latter programs provided a capability of
mounting 300 sixtee_ pin dual-in-line packages or 280 sixteen _in
and i0 twenty foul pin packages. The board consists of six copper
layers permitting two signal layers, two voltage layers and two
ground layers. (Figure 5.16). Each signal layer references a
ground plane providing two layers or 50-ohm microstrip. Proper
tolerance is maintained over line width, dielectric spacing and
dielectric constant.
5-71
IJlll_llll_I))n_
MICROCOPY RESOLUTION Tf..SI CHART
NAIIII_IA| BIJRLA_I 01 _IAhlIAK|_ iql_ ,_
• _._-;_. _._,_ ¢_._;_g_._(;_=_,_
Table 5-13
F_A°ibRES OF BURROUGHS CML CIRCUIT FAMILY
- High spee, a - 0.7ns per raw gat_
- Lo_ power d_[;.y product 4 p per internal gate for LSI, 6pj for
MSI and 8p_ [:cr SSI
Fully compensated logic levels and threshold - Noise margin:;
an, l propaqation delay rem.ain constant over operating t,_mpera-
tui_c ,*rid w)].tage ranges
- So[n.'ee tern'_nated interco.]nection - On-chip output resistors
properly te_ninate 50 ohm transmission lines
•- C_,ml)] ,:ment,,ry, simultaneous outputs - Simplifies design,
m:in_,:.ize.< clo:_s-talk
- ._'mali Ic>,li.c swing of 440mY -- Provides higher speed at lower
}..._ver, [_ noise generation
- C,:_:51 _ ,t supply current - Reduces noise, fewer or smaller
d, co_[:,.JDq r.apacitors, no dependence on operating frequency
Ac,,.-a'_cc._,!Circuit T_chnique - On-chip use o_ series-gating, gate
sfac_i;,g, e'_l (Emitter-Function-Logic), Schottky-Diod,., gating,
wi rer_-.OR and -AND, staggered thresholds, etc, provide best
[uncLional density at lowest power level
- .P,_,st _ Diaqnostic F_in (T&D) - Facilitate testing of individual
packaHes and isolate faulty packages in operating environment
- 50 pud packa,Te - Increases loqic function capability per de'_ice
an,| :'educes package count
- Mu) ti-chip pac}'age - Increases packaging density, improves
performance, and reduces system cost
SFI to L:.I densities - 50 pad package has capacity to accomo-
date gate de]sities in the order ot 1000 gates per chip
5-72
!i
i
i
.O95
.<b_9
-- VCC
-- V_
-- _i_ Z
--%z
Figure 5.16 Multilayered Printed Circuit Board for ECL
Figui_e 5.17 illustrates the component side of a fully populated
board assembly of the PEPE type. The aluminum electrolytic
capacitors along the top and bottom edges of the assembly are
utilized to bypass voltage noise for frequencies below 1 MHz. Two
other levels of bypassing control voltage noise, the interlayer
capacity of the board for frequencies above 20 MHz and ceramic
capacitors (contained in the terminator resistor packages) for
frequencies between 1 and 20 MHz. The board assembly is mounted
in a diecast aluminum alloy frame. Camming type handles are
mounted on the front of the frame to provide the insertion force
to mate the four 100-p_n, I/O connectors. The I/O connectors
incorporate a unique socket design that results in low insertion
force and low contact resistance. A single 100-pin connector
nominally requires around a 13-pound insertion force. Four
connectors would result in an insertion force of approximately 52
pounds. The handles are also used to lock the board in place.
Each circuit card module assembly is supported by shear/locating
pins in front and rear.
This assembly can accomodate cam action zero insertion force
connectors which in turn can accomodate the edge connector of
belted cable paddleboard assemblies.
The assembly may be adapted to mount dual-in-line sockets. Each
socket is soldered to the board to pick up the printed circuit
signal trace. In addition, wire-wrap tail on the socket provides
for two levels of wire-wrap. (Figure 5.18).
5-73
5-74
TEST MODULE
Figure 5.17
DIE-CUT
ALUMINUM
FRAME
24-PIN, DUAL-IN- LINE
INTEGRATED
CIRCUIT PACKAGE
KEYING
PINS
.___ _" _:.._ iii_
SHEER/LOCATING
PIN
CAMMING HANDLE
KEYING
PINS
DECOUPLING
CAPACITORS
16-PIN,DUAL-IN-LINE
INTEGRATED
CIRCUIT PACKAGE
I/O CONNECTOR
16 PIN,
DUAL-IN -LINE
TERMINATOR
MODULE
Component Side of Fully Populated Printed Circuit
Board Assembly
_a
N
,lgl
_72
N
_1
ii
I
!
W
I
I
Met
V_E
- ,_2,
Figure 5.18 Multilayered Printed Circuit Assembly with Dual
In-Line Devices and Sockets
5.12.2.3 Interconnections
Two primary techniques for the interconnection of the basic assem-
blies (processors, memory modules, etc.) help guarantee feasibi-
lity of the FMP. Wherever possible, interconnections will be made
with paddle board and belted cable assemblies. Belted transmis-
sion cable with up to 70 conductors, (AWG 28 or 30, silverplated)
on 0.025 inch centers suitable for the FMP signal levels and
frequencies is readily available. Techniques for semi-automatic
assembly of these cables to paddleboards with edge connectors are
fully developed and provide the economical reliable
interconnections.
Where the use of belted transmission line is impractical, inter-
connections are achieved with subminiature 50-ohm coaxial wire.
The coax consists of No. 32 AWG, silverplated-drain or ground
conductor; a wrapped tape shield of aluminized mylar; and an outer
jacket of laminated mylar. The maximum overall size of the cable
is 0.033 inch x 0.043 inch. The drain conductor is compressed
between the aluminum side of the shield and the primary insulation
such that the drain wire is in contact with the shield along the
full length of the cable. Both conductors (ground and signal) are
wrapped simultaneously on adjacent pins (on 0.100 inch centers)
using a dual-bit wire-wrap gun as shown in Figure 5.19.
°
5-75
SUBMINIATUREt ?.-WIRE ANE
'_-" 50 OHM COAX
i :
! ;
5-76
POWER DISTRIBUTION PINS DUAL- BIT
WIRE-WRAP GUN
Figure 5.19 Backplane with Subminiature Coaxial Wire
i'i
5.12.2.4 Baokplanes
Backplanes for power distribution are not required for the proces-
sors as they have individual power supplies. However, in the case
of the coordinator and connection network it may be more desirable
to have a centralized power source which for high speed ECL techno-
logy would normally require a laminated backplane assembly.
This assembly consists of three layers of epoxy-coated copper. It
serves the dual functions of: i) mounting the female half of the
circuit card module assembly connectors, and 2) efficiently distri-
buting power to each circuit card module assembly by providing a
low impedance power distribution network.
Power is distributed to each circuit card module assembly via pins
soldered to the individual backplane layers as shown in Figure
5.19. A wire wrap connection is then implemented between the
backplane and associated connector pins. Multilayer, laminated
backplanes are required to minimize backplane impedance (primarily
inductive). A low inductance offers a low impedance to surge
currents, guarantees power supply stability, and gives fast power
supply response time.
5.12.2.5 Cabinet Frame Assembly and Doors
At this time, it is anticipated that the FMP equ_nent would be
housed in cabinets similar to those used on other advanced proces-
sor systems currently being made by Burroughs. A description of
these assemblies is provided in the following.
The cabinet frame is constructed of 0.120-inch-thick rectangular
steel tubing welded into a unitized frame. In certain areas the
rectangular steel tubing is increased in thickness to 0.180 inch
for strength considerations. The overall dimensions of the basic
weldment are typically 81 inches high by up to 72 inches wide by
at ]east 30 inches deep. Maximum envelope dimenslons of the
cabinet assembly, including all doors and end panels are 81 inches
high by 100 inches wide by 32 inches deep.
Bi-fold doors are utilized on the front and rear faces of the
cabinet. Each bi-fold assembly (there are four) is composed of
two 0.75-inch-thick aluminum honeycomb panels connected by a
unique, extruded, continuous hinge. The stationary panel on the
right-hand end of the cabinet is constructed of 0.062-inch-thick
formed alui_inum. A hinged split door configuration is utilized on
the end of the cabinet to provide access to the rear. The overall
thickness of the split door is 2.13 inches. Each door section is
comprised of 1-inch-thick aluminum honeycomb and 0.0062 inch-thick
formed aluminum.
5-77
5.12.2,6 BCML Packa@in@
5.12.2.6.1 General. A complete family of packing hardware has
been specifically d,Jveloped for the Burroughs CML circuit family.
This advanced hardware family incorporates features to accomodate
subnanosecond high density circuits of greater than i000 gates
each for use in commercial state of the art computer systems. The
family includes low cost modular liquid cooling and power distri-
bution systems. The design concepts placed high consideration on
manufacturability and ease of assembly debugging and maintenance.
The BCML packaging system provides hardware that can be used
across the Burroughs product lines of the computer systems and for
other special applications. The basic philosophy of this pack-
aging system was to partition the second level packaging to be
compatible with functional logic partitioning. By packaginq a
system function within an integral unit, the number of ._/O's
between units is minimized and critical functions can usuall ' : 4
restricted within this unit.
5.]2.2.6.2 Circuit Packaging. The basic partitioning size
selected for the BCML system is a printed wiring board 14" x 21".
This unit is referred to as an island and can accommodate i0,000
logic gates with the current normal mix of SSI, MSI. and LSI BCML
parts. With the increased usage of LSI, and VLSI circuits island
gate capacity will be enhanced.
Another basic goal of the BCML packaging system is to provide for
ease of field maintainability. The following are some of the
packaqinq as well as circuit features that facilitate service-
ability :
I. Plug-in logic packages.
2 A probe system to allow simultaneous contact of all logic
package pins.
3. Provision for in-place testing of circuits.
4. No external components in wiring nets.
5. Test and Diagnostic pin (T&D) incorporated on logic
packages.
The first level of packaging was selected to accommodate a circuit
family aimed at high gate densities. Two package sizes, 25 pins
and 51 pins, are utilized. Multi-chip versions of the 51 pin
package can accommodate up to 3 I.C. chips.
5-78
__k
C
The packages themselves are a leadless hermetic ceramic construc-
tion. The package has gold plated contacts on 50 mil centers in
two rows on its edges. The package also has an integral metal
heat sink plate. This member conducts heat generated by the
circuits to a liquid cooled frame and also serves as a low induc-
tance ground connection.
Two 25 pad packages or one 51 pad package mates with a 50 pin
=onnector. This connector will also accept two (24) signal I/O
cables or one 50 signal I/O cable. The interfaces of all the
pressure contact systems are gold plated for high reliability.
Two types of connectors are available; this first type is soldered
to the interconnecting printed circuit board and has a wire
wrappable tail while the second makes a pressure contact to a gold
pad on the printed circuit board. These two styles of connectors
provide flexibility in design of the island interconnect media.
There are 108 connectors mounted on the logic island as well as a
liquid cooled frame. The cold frame also serves as a low resis-
tance ground return path. Interconnection of circuits on an
island is accomplished by a combination of P.C. lines and open
wire. A multi-layer board with internal voltage and ground planes
and two external signal layers with 50 ohm lines are used for the
bulk of the interconnections. The shorter lines can be imple-
mented by automatic Gardner-Denver wiring with no performance
penalty. An all wired utility board system utilizing controlled
_mpedanee twin lead and open wire is available for prototype and
limited production systems. Higher density and lower cost P.C.
interconnection systems are beinq developed for both the _n]_
tail and double contact connectors.
Islands are interconnected with a high quality 50 ohm transmission
belt (24 or 50 signals), since a cable interfaces with the same
socket as a logic package, the ,atio of I/O pins to logic posi-
tions is not fixed by the hardware, but is established by the
logic design. This flexibility provides for efficient island
utilization. Figure 5.20 _hows an island assembly mounted in an
module with belted cables interconnections.
5.12.2.6.3 Frame, Cooling & Power: In addition to a standard
logic family, island, and interconnecting belts, the BCML pack-
aging system also provides a mounting structure, cooling system
and power system for a I0 island module. This I0 island module
can be used individually for smaller systems or can be stacked 2
and 3 high for larger systems. The 50 ohm belted calbes provide
module to module interconnections.
5-79
5-80
0
IN
e
.J-i
r_
N
N
:ii_ag.N'
_;_1
iI
The module assembly enables the islands to fold out permitting
front and rear access thus facilitating testing and maintenance.
This feature is illustrated in figure 5.20.
The cooling system, which can dissipate a 3.6 KW heat load, con-
sists of cold frames mounted on the islands, a circulating pump,
fans, and a liquid to air heat exchanger. Air for the cooling
system is drawn from the computer room. For highly reliable
operation junction temperatures are restricted to 80Oc with a 40Oc
ambient. Much higher power (or lower junction temperature) could
be obtained by using a liquid to liquid heat exchanger with a
chilled coolant circulated through the island. This system does
not require air circulation in the computer room, with heat being
dissipated directly to the building chilled water supply.
The BCML power system is designed to be driven by an M-G set or an
equivalent line isolator. Large systems may be operated from a
site M-G set but a 20 KVA M-G set has been packaged in a sound
proof cabinet for installation in the computer room for use with
small to medium systems.
The }mwer supply itself is a very simple and reliable design,
consisting of only a transformer and rectifiers. Output is -2.7V
± 30% and -4.8V ± 25%. Final regulation is provided by circuitry
on the logic chip. This on-chip regulation produces a constant
current load. Therefore voltage decoupling capacitors are not
requlred on the P.C. board.
5.13 IMPLEMENTATION TOOLS
Burroughs Corporation has a central Design Assistance (CDA) Depart-
ment which is charged with the responsibility of developing and
maintaining a comprehensive set of tools to aid in the design,
manuflcture, and maintenance of computer systems. These tools are
then adapted as needed and used by the various design and manu-
facturing groups.
The design of a complex system such as the FMP, requires the use
of sdch tools. The Design Assistance System (DAS) and the
Burroughs Interactive Logic Design (BUILD) program are examples of
aids used during design. Specifically, the DAS programs provide
assistance in the development of manufacturing tooling from a
detailed logic design. The areas supported are:
Logic partitioning
component placement
:)rinted circuit routing
wire wrap routing
Logic simulation
test generation
Logic schematic generation
rules check
numeric control generation
%
5-81
In addition, design data is maintained in a centralized data base
to insure design integrity.
The Burroughs Interactive Logic Design (BUILD) program allows a
design engineer to hierarchically specify a logic design, and to
verify its correctness using functional simulatin techniques.
After logic verification, netlists are generated from the logic
specification and entered into the DAS engineering data base for
physical implementation.
Figure 5.21 depicts these two systems as they would be used by the
NAS? project.
i
5-82
L]
U
BUILD
INPUT
PROCESSOR
I SCHEMATICS I LOGICSIMULATION
LOGIC
DESIGN
;
PHYSICAL
DESIGN
I PARTITIONING PLACEMENT I
DESIGN
I Ass ""lCHECK GENERATION
SYSTEM
DATABASE
I PRINTED I
CIRCUIT WIRE-WRAP
ROUTER ROUTER
,&
,7
_._;, .;,,
:,_
_',
" ,!_2_
i_ _h';
_,
DATA
TEST &
FIELD
DOCUMENTS
I ELEMENT CIRCUIT I SYMBOL JRUL S RULES RULES
CHIP I DEVICEMODELS RULES
CARD
DESCRIPTION
Figure 5.21 NASF Hardware Design and Implementation Support System
5-83
i
I
CHAPTER 6
TRUSTWORTHINESS AND AVAILABILITY
6.1 TRUSTWORTHINESS, AVAILABILITY, AND ERROR CONTROL
6.1.1 General Requirements
As the introduction to Chapter 5 has already emphasized, the FMP
has certain requirements for trustworthiness, availability, and
error control. Among these basic requirements are:
- System availability of 90% or better, implying an FMP
availability of approximately 95% or better, for 20 hours a
day.
- Mean time between aborts vlsible to the user of over i0
hours.
- Probability of apparently successful but wrong runs much
lower that the probability of an abort.
/r
]
In order to satisfy the above requirements, a number of features
are built into the design, includingz
- Spare processors and extended memory (EM) modules, with
software-controlled reconfiguration
- Duplexed operation with comparison of results
- Error detection and error correction on all memories
- "Scrubbing" through CCD memory and dynamic RAM memory to
find and correct any spontaneously occuring errors within
them
- Fault detection within logic circuitry (processor,
coordinator, etc.)
- Software-controlled restart following a program abort
- Logging of all errors, analysis of the logs
- Testing of invariants in the computation
- The ability to observe externally the state of the FMP
- A system of diagnostics and confidence checks
- Error detection in file system, both storage and transfer
paths
6-1
These features are implemented by a combination of hardware and
so ftwa re o
The trustworthiness of computation on the NASF is the combined
result of a series of influences, including
- System software
- Hardware reliability
- Hardware error detection
- Completeness of the confidence and diagnostic checks
- Applications programming characteristics
- Accuracy of failure identification
- Throughness of checks for software errors
6.1.2 Design Requirements
Additional characteristics can be derived from the basic require-
ments of the previous section. These cha[acterlstics were derived
in Reference 5, and can be summarized as follows_
- Less than 1 bit in 1017 in undetected error from processor
memory
- Less than 1 bit in 1015 with error detected but uncorrect-
ible from processor memory
- Less than i bit in 1015 in undetected erlor from EM
- Less than 1 bit in 1013 with erro_ detected but uncorrect-
ible from EM
- Less than i bit in 10 23 bits refreshed in DBM shall have an
undetected error
The derivations were based on observations on how many bits were
accessed from memory and from extended memory during the typical
15-minute run, and on an assumed time of residency in DBM that
might be as long as a day.
6.1.3 S_arin_ and Duplex Processing
Every processor cabinet has 129 processor slots; every EM cabinet
has 132 EM module slots. In the coordinator, there are foul reg-
isters, one per processor cabinet, that designate the spare pro-
cessor in that cabinet. Spare EM modules are designated by regi-
sters in the CN buffel of every processol. The coordinator broad-
casts the designation of the spare to the CN buffer, using BDCST
instluction, and follows that with a FILLR command to load these
registers. Thus the designation of which modules are spare is
changed by softwale lesident on the coordinator.
il
i!
Duplex processing has been proposed as a means of providing
dynamic, run-tlme checking of processors by comparing the results
of the same set of computations performed in two different
processors. Two approaches were considered and are discussed
below.
The spare processor designation is used to p_ovide a duplex mode
of operation. First, one must make sure that there are 516 good
processors in the FMP. Second, processor #128 is designated
"spare" in each cabinet. This makes programmatlc processor
numbers 0 through 127 fall on physical location number 0 through
127. Third, the p_ogram is run. Fourth, Processor #0 is
designated "spare" in each cabinet. This makes programmatic
pl:ocessor numbers 0 through 127 fall on physical processors 1
through 128, so that every computation in the run will fall into a
different processor than the first time. Fifth, the program is
run again. Sixth, the results of the second run are compared for
the expected match with the results of the fi|_st run.
Another form of duplexed processing was considered during the
course of the study. Here the duplex mode would be implemented
through some additional hardware. The 512 processors would be
divided into 256 sets of 2 each. The application program would be
compiled and run as if only 256 processors were available. Each
set of 2 processors would execute the same code on the same data,
and the resultts of each would be compared. The operation of the
CN is such that continuous synchronization between the two members
of the pair would require additional hardware means than described
in Chapter 5, such as making both processors use the CN buffer of
one of them. A hardware comFarator would monitor the performance,
and errors would be detecte6 as soon as the outputs to the CN or
to the coordinator, of the two processors, fail to match. Because
of the synchronization problems, and because there seems to be no
real advantage of this scheme over the purely software duplexed
computation described first, the hardware comparator has not been
included in the design.
6.1.4 Error Cor,:ection in Memories
All memory has error detection and error correction in order to
achieve the very low error rates of the requirements. Error
detection is a necessary part of hard failure detection. Error
correction is proposed based on l_xpected memory error rates
between 1 bit in 103 and 1 bit in i0 .
For processor memory and extended memory, a SECDED (single error
correction, double error detection) code is proposed. The actual
error rates in the chips would have to be very good indeed before
simple parity plus retry would provide adequate correction. The
actual error rates would have to be very bad (worse than 1 bit in
108 ) before simple SECDED was not good enough.
%
6-3
6-4
For Data Base Memory (DBM), a higher intrinsic error rate is
expected from the chips, since the geometries on the chips are
smaller, and since more refreshes occur per access. Also, a
higher standard of performance is required, since any given datum
will go through many read-write restorations during the lifetime
of the data in DBM. As the computations in Ref. 5 show, we expect
that the same simple SECDED will also be adequate error correction
in DBM. However, the safety factor is substantially less, and a
reevaulation of this choice should be made when the soft failure
rate of the 256K CCD chips become known.
In the DBM it is also necessary to periodically read each word,
make necessary corrections if possible, and write it back in, in
order to keep the probability of multiple errors low enough. This
process is called "scrubbing" and is expected to be designed into
any memory system requiring it. Therefore, the DBM will not
require any external controls for scrubbing. The errors removed
by "scrubbing" are called "soft errors". This term, soft errors,
implies failures where the contents of the storage cell have been
modified in some unexpected or unplanned way (such as by the
effect of background radiation), but which are not the permanent
inability of a storage cell to operatoe correctly. The following
paragraphs discuss the SECDED scheme proposed and also discuss the
scrubbing of errors out of DBM.
6.1.4.1 SECDED
For soft failures, the previous studies (5) show that the improve-
ment factor; due to error correction is essentially infinite; that
is, the system would be unable to produce useful results without
error correction at the presumed soft error rates. For hard
failures, the improvement factor due to the use of error correc-
tion depends on the failure modes. Some failures, such as an
address decoding failure external to the memory chip that causes
multiple bit errors, are not helped by error correction. A fail-
ure internal to a single memory chip is helped by error correc-
tion. In addition, the error correction circuits have failures
that would not occur if there were no error correction. The
analysis following in Section 6.3.3 recognizes the other effects
contributing to undetected errors. That section uses a very
conservative improvement factor of 5 in the number of observed
errors when using SECDED for correcting hard failures vs the
situation where no SECDED is used. The following discussion
addresses these statements in more detail.
First consider the case of soft failures as represented by read
failures. About 5 x 1012 operands are used or produced during the
course of the "typical" 15 minute run (5). If half of these come
from processor memory, that means almost 2 x 1014 bits are fetched
from processor memory during the course of a typical run.
Although accurate projections of bit error rates for large semicon-
ductor memory chips await more experience , it is plausible that
bit error rates may lie between one bit in i0 I0 to one in 1014
bits read. Under the above conditions, without error correction,
it is unlikely that the typical run can even complete successfully
due to soft errors.
it
For hard failures however, the picture is different. If one
memory chip output is stuck in one processor, only 1/521 of the
words accessed are affected by that failure. The processor memory
delivers 4 x I0 II bits during the course of the run. If one bit
in every word in that one processor is bad, and if the soft error
rate is 1 in 1012 or better, the run will probably complete
successfully. A failure at a specific bit in one chip is even
less likely to cause trouble.
Since double errors or worse are not corrected automatically with
the proposed SECDED code, it is important to use preventative
measures. When the SECDED logic corrects a failure, a log will be
updated indicatil,g the word and bit position in the memory which
was corrected. These logs will be examined regularly in order to
detect and replace failed parts before they cause an abort.
The error correcting code of Table 6.1 appears to be the best
choice for the FMP. First, it is directly implementable by the
Motorola SECDED parity generator chip (each 8-bit wide slice of
the code exhibits exactly the same pattern of parity checks as
found in that chip). Second, it is much better than a randomly
selected SECDED at detecting triple errors. Even the optimum
SECDED is not very good at detecting triple errors when there are
55 bits used out of the underlying 64-bit long Hamming plus-parity
code block. This proposed code is almost as good as that optimum.
Each "x" in Table 6.1 means that that bit is included in the
parity check represented by its corresponding checkbit. The seven
check bits are the 6 bits of the Hamming code, plus a bit that
allows an overall parity check. Fol improved performance against
multiple errors, the 7th bit contains an "x" only for those bit
positions which enter into 0, 2, or 4 of the other check bits.
Actual overall parity is the parity of all seven check bits. Odd
parity is used.
The bit number in Table 6.1 is not the bit number of the data
word. For one thing, the check bits ale interspersed. T_e corres-
pondence of bit number as seen by the programmer to the bit number
of Table 6.1 is arbitrary. This mapping will be left as a logic
designers option.
Triple errors appear to be single errors to the proposed code.
Some triple errors will be detected when the SECDED circuits
detect a failure and attempt to correct a bit outside the 55-bit
field shown in the table (this is possible since the code chosen
is a portion of an underlying 64-bit long Hamming plus-parity code
block). The code shown in Table 6.1 detects 14.6% of all triple
errors, whereas a randomly selected SECDED would be expected to
detect 14.1% (nine bit locations out of 64 are outside the 55-bit
word).
6-5
TABLE 6.1
Error Correcting Code
%
Check bits
Bit NLmber *
Check bit
Parity
Patterns
Ist
2nd
3rd
4th
5th
6th
Par ity
XXXX X X X
0000000000111111111122222222223333333333444444444455555
01234567890_23456789012345678901234567890L345678901234
.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.x.
..xx..xx..xx..xx..xx..xx..xx..xx..xx..xx.oxx..xx..xx..x
....xxxx....xxxx...oXXXX .... xxxx.o,.Xxxx ....xxxx ....xxx
........xxxxxxxx........xxxxxxxx........xxxxxxxx.......
... ..... ....... .xxxxxxxxxxxxxxxx................xxxxxxx
....... ................ ... .... ..xxxxxxxxxxxxxxxxxxxxxxx
x..x.xx..xx.x..x.xx,x..xx..x.xx..xx.x..xx..x.xx.x,.x.xx
m
* The assignment of bit number (corresponding to Hamming's) may be
different than the assignment to be found in the register to which
this parity check is attached. The bit number found here is the
one generated as an indication of the bit to be corrected in the
error correcting code.
i
"E 6-6
iCodes which are useful at detecting triple errors are also of
interest. One additional check bit allows a code in which triple
errors are almost always detected (better than 90% of the time).
The price for this improved error detection capability is a
connection network (CN) one bit wider or extended memory (EM)
access time 20 ns longer, more complex parity checking, more
complex decoding of the bit in error and 2% more memory. Current
estimates of memory chip bit error rates imply that this addition-
al complexity is not warranted.
SECDED checking and generating logic is found in the following
locations:
- Processors, where the processor genelates check bits for
all memories it accesses (both PM and EM via the CN buffer),
and checks words fetched from the PM or received via the CN
buffer.
- Coordinator, where the function is parallel to that in the
processors.
- DBM, in the channels to and from the file system.
SECDED logic is not needed in the EM modules, since all EM data
will have check bits when stored, and will have their codes check-
ed at some point after being fetched from EM, usually upon being
read from a CN buffer in a processor.
In addition to the SECDED on all memory data, there are some
simple parity checks. The address-plus-instruction-code sent
through the CN for controlling EM buffers has parity checked at
EM. The contents of microprogram memory in the processor have a
parity bit.
The responses to SECDED and parity errors are as follows:
i. EM module detects parity error on module-number/address/
op-code field sent from processor. The EM module does not return
an Acknowledge on bad parity, so the processor will continue to
send the same request. If the error was a transient, proper
operation will resume. If the error was a hard error, the proces-
sor will hang on trying this request, eventually causing the co-
ordinator to have a time-out interrupt. The EM module sends an
"address parity bad" interrupt to the coordinatar. This would
normally be masked off to allow useful processing to continue in
those cases where the retry works.
2. Processor corrects single error. An interrupt to processe v-
resident software results in the logging of the action in a table
in processor memory.
6-7
3. Processor detects double error in word received flom EM. The
processor halts with interrupt, and the program is discontinued.
Software can restart the program from some plior point, possibly
after system reconfiguration.
4. In all of the above, the requesto_ may have been the CN buffer
of the coordinator used by the coordinator for accessing EM. In
these cases, read "cooldinator" where the previous two sections
say "processor".
6.1.4.2 Scrubbing Errors out of CCD Memory and Dynamic RAM
In the case of CCD memories, errors are not confined to the
reading and writing process. Errors can also arise within the
memory chips. If data is stored in a particular location with no
reference for a long time, such as hours or days, the probability
of errors may become intolerably high. It will be necessary,
therefore, to continually scan through the data base memory (DBM)
correcting all the single-bit errors in order to allow the
survival of the data base for a long enough period of time.
Depending on the magnitude of the soft-error problem, it may be
feasible to use a stronger error-correction code, and thus
eliminate the scrubbing. With scrubbing, the probability of
non-correctlble errols grows linearly with time, the envelope of
pieces that individually have the folm te where e is the number of
erlors in the uncor[ectible case (Figure 6.1). With stronger
error correction, correcting f errors, the curve has the form tf.
e=2 for Hamming plus parity, f can equal any number for a BCH *
code (7). Clearly, the "scrubbing" storage design has more lati-
tude against variations in error rate.
The critical aspect of DBM is the storage of restart files, up to
10 9 bits, for times that presumably could be days. The method of
error correction used will depend on the technology to be used for
the file.
To determine the optimum rate for scrubLing errors out of CCD
memory, we should know both the error rate for spontaneously
occurring errors, and the error rate for the reading and writing
process. For any given error correcting code, there will be an
optimum scrubbing frequency where the two sources of error are in
balance and are a minimum.
In Reference 5, the assumption was made that the CCD memory of DBM
would lose on the average, 1 bit per 3 x 1016 bits shifted. This
error rate was based on preliminary experience reported by Fair-
*Bose-Chaudhuri-Hocquenghem
6-8
PPROB_B|LITY
OF NOT
RECOVERING
FROM ERROR
USING CODE
THAT CORRECTS'-.--..._
t-1 ERRORS
tf
t = TIME IN STORAGE
I
THAT CORRECTS
e-I ERRORS
Figure 6.1 Scrubbing versus Read-Time Error Correction
6-9
child. Since then, the cause of loss of bits has been identified
as background radiation, primarily due to alpha particles coming
from contaminants in the package. More recent quantitative data
is not available. New manufacutring techniques by the vendors
appear to be solving the problems. On the basis of the original
soft-failure rate data, a scrubbing rate of once every seven min-
utes will be enough to keep a 10 _ bit file error-free for one day
with probability 0.999.
Scrubbing in the DBM will make use of hardware and data paths
which would exist even if scrubbing were not necessary. In parti-
cular, the channels to and from the file system have buffers and
SECDED checkers and generators associated with them. Part of the
normal channel/interconnection path capabilities would be a loop-
back mode for diagnostics. All of these capabilities can be
utilized to implement scrubbing as needed. The DBM controller
will schedule blocks (probably 16K words) to the channel buffers
through the SECDED checker/generator and back to the CCD store.
The maximum transfer rate between the DBM and the file system is
expected to be 40 Mbits/sec. At this rate, the entire DMB can be
read in 3.5 minutes. Periods of high channel activity imply
lowered requirements on scrubbing due to natural activity within
the DBM. It is, therefore, reasonable to plan to use some of the
channel capabilities (buffers, SECDED, loop-back) to implement the
scrubbing functions. If DBM blocks are 16K words (a likely result
of CCD organizations), and if the scrub cycle needs to be seven
minutes then the scrub rate is one block every 51.4 msec.
AS the geometries of the individual cells of integrated circuits
shrink, other parts are expected to evidence soft-error problems
similar to that being experienced by CCD parts now. 256 Kbit
dynamic RAMs, which may be considered as a technological alterna-
tive to the 256 Kbit CCD's depending on the design and implemen-
tation schedule), are expected to experience a soft-error rate
large enough to also require scrubbing. The parts currently
planned for the extended memory (EM) have large enough geometries
that the soft-error rate is very low. In addition, the EM does
not contain any long-term data. Hence, no scrubbing is necessary
or planned for the EM.
6.1.5 Error Detection and Correction in the Connection Network
The Connection Network (CN) is of central importance in the imple-
mentation of the proposed FMP. Since the design of intercon-
nection systems are generally not as well understood as processors
and since there appears to be less redundancy, the planned defen-
ses against erroneous operation are described in some detail
below.
6-10
6.1.5.1 Magnitudeof the problem
As described later, single transient errors are self-correcting in
the use of the CN. The only faults that might cause problems are
hard failures (i.e. permanent failures). The discussion below
shows that hard failures can always be detected dullng execution
of user program (and therefore by implication detectable during
confidence tests). Section 6.1.10.3, which follows, shows that
these faults are diagnosable once the job in process has been
aborted.
As to the magnitude of the problem, the CN is built from 39,280
identical LSI circuits. If these circuits have the "normal"
failure rate of 0.1 failures per million hours, the expected MTBF
will be 254 hours, or 33 failures per year. During the entire 10
year design life of the FMP 330 failures are expected. With the
fault detection and isolation techniques outlined below, it is
very unlikely that one of these expected 330 failures will be
undetected.
6.1.5.2 Defense vs. Type of Fault
6.1.5.2.1. _ transient error in the request sent to EM. A
single translent failure in mo--_16 number, address, or opcode
field causes the EM module to detect a parity error, which causes
the processor to retry the operation in question. System software
normally allows retries to proceed unmolested.
6.1.5.2.2. _ transient error in data. This is corrected by
the error correctlng code and loggedT- Computation proceeds_
6.1.5.2.3. Hard failure on the path _ one processor to EM.
This hard fa-_e will eit-her---_ause parity er---r-orsto be de-_ct-_
by the EM or SECDED errors in any words stored. The analysis in
section 6.1.5.3 shows that over half of the addresses sent through
the fault are detected as errors. The result is that such an
error will be detected very quickly.
6.1.5.2.4. Hard failure on the data path from EM to processor.
Only data, w_-_ SECDED, flows o-_e{ t-his path? The ana-lysis of the
next section shows that over two-thirds of the faulty data words
are detected. Thus, such a failure is quickly detected, usually
on the first word transmitted after the failure occurs.
6.1.5.2.5. Hard failule in the path-selecting control io i_
Here, there ar-_ seve-_ _se_ to_nside[.
First, if the wrong path is selected, and if the wrongly selected
EM module has a different number of bits in its CN port number, a
parity error is detected at the EM module. Half the EM modules
will detect such a parity error, so that EM accessing will not go
on for long without the error being detected.
6-11
Second, the correct path is _elected, but with wrong priority, so
that a particular processor is being discriminated against. The
program will continue to execute correctly, but execution time
will be lengthened for certain patterns of access conflicts in the
CN. We believe that analysis will show that such priority
failures are harmless for some programs, including aero flow
codes, but no simulations to verify thls expectation have yet been
done. Such failures can be found by diagnostics. All processors
are sent to fetch from EM, execution is allowed to proceed for a
fixed time, and then it is observed that the processors with
correct results are not the expected set.
Third, the strobe line is falsely high. This will caase the CN
buffer to think that the EM module has granted access when in fact
it has not. If there are no CN delays, the correct wo::d will come
back in spite of the fault. When there are delays, the CN buffer
will pull in "garbage" since no real word is coming back at the
time the false acknowledge says it is. Since the path from this
CN buffer, if blocked, is blocked for at least one CN clock, that
garbage is either all zeroes or all ones, for which the Hamming
error correction identified bit 63 and bit 56 respectively as the
bit in error. Since there is no such bit, the error is immediate-
ly caught.
6.1.5.3 Analysis
As described previously in Chapter 5, the Connection Network is
designed to transmit a sequence of ll-bit frames. The main
purpose of this approach is to reduce the number of wires and the
complexity of the network itself. If the entire message is 33
bits long, then a stuck-at fault will change either 0, i, 2, or 3
bits depending on whether those bits were the same value as' the
bit produced by the stuck-at fault or not. A stuck-at-ONE fault
produces no errors when all the bits were ONE to start with. When
the entire message is 55 bits long, the stuck-at fault jams five
successive bits to the state at which the fault is stuck, produc-
ing 0, i, 2, 3, 4, or 5 errors.
First consider the case that the module-number/address/opcode is
being passed to the EM (33 bits) and the bit of the EM module
number is the same as the value at which the fault is stuck. The
remaining two bits can have either 0, i, or 2 errors. When the
remaining bits are address bits, it appears valid to assume that
they behave as random bits. Hence we have 25% of the time no
error, 50% of the time a single error that is detected by parity
failure, and 25% of the time a double error that is not detected.
Exactly two thirds of the errors are detected. Aften ten addres-
ses have been passed through this fault, the probability of the
error being detected is 99.9988%; after twenty, 99.99999997%.
6-12
b,,
2. Take the case as above, except the EM module number bit is
wrong. When parity is checked, at the wrong EM module this time
there will be either i, 2, or 3 errors in the 33-blt package,
again, with probability 25%, 50% and 25%. The single and triple
erl:ors result in parity errors and are detected. Thus, exactly
one half of the errors are detected. After ten addresses have
passed through this fault, the probability of the error being
detected in 99.9% after twenty, 99.9999%, again on the random
assumption for addresses.
3. For the third case, data, the analysis is of the same kind,
but there are more cases Hence, it is easier to present the
analysis in the form of a table for the cases that there are 0, i,
2, 3, 4, or 5 errors. For each possible number of errors, Table
6.2 shows the percentage of time we expect to find such error,
(the binomial distribution) when it occurs, the percentage of time
that this hardware error causes no operational failure (for ex-
ample, a single error is corrected using the SECDED code), the per-
centage of time that this number of errors is detected, and the
percentage of time that a single data word can slip by in e_:ror.
14.6% of the triple errors will be detected, and 85.4% of them
will appear to be correctible single errors and therefore not
detected.
Table 6.2
Single 55-bit Data Word passing through CN with single hard fault
No. bit Occurrence No. Func. Error Error
era'or s Failure Detected Undetected
0
1
2
3
4
5
3.12%
15.62%
31.25%
31.25%
15.62%
3.12%
3.].2%
15.62%
- i
1
31.25%
4.56%
15.62%
0.46%
26.69%
2.67%
i
TOTAL ! 18.75% 51.89% 29.36%
6-13
From the table, we see that the ratio of detected failures to
undetected failues is 51.89/29.36. That is, 64% of all the func-
tional failures are detected. After ten words have passed through
this hardware fault, the probability that the fault has been
detected is 99.9999%, assuming random data.
6.1.5.6 Logical Checks
Miscellaneous logical checks can be considered. The design intent
of these checks is to localize the effects of some error in the
FMP. The following list of checks includes those also listed in
Appendix C in the list of interrupts, plus others.
- Parity checks on microprogram
- Memory bounds checks (optionally inserted by compile,)
- "Uninitialized" word fetched to instruction decoder or
floating point unit
- Illegal opcode
- Detection of unnormalized floating point operand (except
second half of double length floating point)
- Integer overflow or underflow
- Divide by integer zero
- Floating point overflow (either tested for or marked
"unrepresentable", a compile time option)
- Timeout
In addition, there is a bit in the interrupt register reserved for
any miscellaneous logic malfunction checks that will be built into
the hardware. Lock-up of the end-around carry chain _s an example
of the sort of logic error whose occurrence would be reported in
this bit.
6.1.7 Resta[ t
Previous analysis (5)
start dumps is given by
shows that the optimum time between re-
Top t = (2TgTr)½
where Tg is the mean time between failures (intermittent or hard)
that cause an abort, and where T r is the time spent taking one
restart dump and also the time required to load the restart point
and switch to user programming, assumed equal.
6-14
Typical runs are 10 minutes, and typical data bases are 15 x 106
words (5). Since 15 x 106 words are loaded into EM in 0.375 sec,
we have T r not more than about 0.5 sec. Since T_ will be on the
order of i0 hours or longer, we have Top t = (2-. 0.5 36000)%
seconds, or just over 3 minutes. However, the amount of compu-
tation time saved by dividing a typical I0 minute run into three
restartable segments is estimated to be about 0.3% of all FMP
time I. Hence there is little point to providing restart points
within the tyical 10-mlnute run.
Unless restart points are provided by the user, the restart
strategy will be to restart the same task again automatically
under software control. Automatic restart is limited to those
aborts that are probably caused by hardware error, such as parity
errors. Aborts that are likely to be software errors, such as
addressing errors, will not trigger automatic restarts. The
operating system handles automatic restarts, and reports their
occurrence.
The two types of restart dumps mentioned above should be deline-
ated Automatic restart dumps are likely to be a roll-out (with a
later roll-in) of the entire job. In this case, all data space,
variables, flags, et. al. would be dumped to the file system via
DBM. Restart points provided by the user are expected to be more
restricted. The user would be permitted to specify selected data
areas to be affected and to specify when such snapshots are to be
taken. These user selected restart dumps would be much more effi-
cient and cause considerably less load on the system than the auto-
matically generated dumps. In addition, the user will be per-
mitted to insert an alternate entry point in his main program (a
restart point), where appropriate arrays from the restart dumps
would be reloaded. In the initially delivered system, automatic
checkpoint restart transparent to the user will not be included.
Such a facility would be included at a later time.
iTop t is about 200 seconds; T r is 0.5 seconds. At optimum, the
fraction of time lost due to restart dumps is approximately equal
to the time lost due to wasted computation, that is, computation
that ends in an abort. In a 10 minute run there are two 0.5 sec
restart dumps, plus the initial loading of the data base (total
1.5 sec) and an equal expected amount of time lost by aborting
good computation. (1.5 sec. + 1.5 sec.)/600 sec reduces net
throughput to 99.5% of what it would have been if defense against
aborts were not necessary. If no restart dump is taken during the
i0 minute run, the 1.5 sec of data moving is reduced to 0.5 sec.
However, the time lost from wasted computation will triple, since
600 seconds is triple 200 seconds. Hence, the percentage of time
lost is (0.5 sec + 4.5 sec.)/600, and net thloughput is 99.2%
instead of 99.5%. This is a small price to pay for the conven-
ience of not having to worry about restarts.
6-15
rf;
Since the intent of on-llne spale components is to provide the
capability of maintaining the desired level of performance through
aucomatlc reconflguratlon, the system software would be able, in
most cases, of automat_cally restarting a job whose execution may
have been interrupted by a hard failure. When the job is a
program with user-specifled restart dumps and restart points, the
most recent restart dump would automatleally be chosen and the
execution would resume at the restart point. For elample, in a
one-hour run involving 500 time steps, it might be reasonable to
specify a restart dump every 25th time step. The restart entry
point would include the relnltiallzatlon of control variables to
states saved as part of the restart dump. The above technique is
particularly appropriate to the aero flow codes, where the computa-
tion converges.
An analysis of the effect of restart on the operation of the FMP,
using the reliability model, is contained in Section 6.2. In that
section, the assumed "restart" time of 6 minutes, corresponds, not
to the T r above, but to the total time spent at the time of
restart, including system software response to the abort, logging
of error, reconfiguration of the system, if any, and running of
confidence (and possible diagnostics as well depending on the type
of error detected). Six minutes seems extremely generous.
6.1.8 Error Lo_@ ing
Where possible, all errors are logged. The mechanism for logging
errols is via interrupt. Both the processor and the coordinator
have three classes of interrupts which can be used for logging.
One class of interrupts reports non-fatal errors (such as single-
error correction of a transient parity error detected in EM).
A second class of interrupts are the programmatic interrupts
(CALLI instruction) which can be used for calling on system
software to log errors. In many cases these may be errors
detected by tests inserted by the compiler into the code stream.
The third class of interrupts is used to log all fatal errors.
Since fatal errors involve some non-correctable situation, these
interrupts are usually directed to the coordinator. In the case
of the coordinator itself, they are directed to the diagnostic
controller and the support processor.
The design intent is to record the memory address and bit number
of bit in error (also called "syndrome") for all SECDED error
corrections and detections. It is likely that programmatic inter-
rupts will report not only the observed error condition, but also
a code which would be used to obtain a llnk back to the original
source code. A table in each memory holds the record of the last
N errors corrected. The size of the error log tables and the
frequency with which they are collected and reported has yet to be
determined.
6-16
!i
i
!
!'
i
6.1.9 Invariant s
Applications software is one of the links in the chain that main-
tains the trustworthiness of the NASF. Although the application
software is outside of the scope of wolk when developing a faci-
lity, it is part of the system seen by the usel and therefore must
be considered when discussing the trustworthiness of a system.
Inclusion of checks on quantities which should be invarient or in
some way well behaved during the course of the computation seems
applopriate. Examples might be:
- Total quantity of air within the mesh (as c_puted from the
appropriate function of geometry and pressure) should change
in accordance with air inflow and outflow at the boundary.
- Any global criterion for convergence should improve monoton-
ically for steady airflows.
- Changes in total energy in the system should correspond to
energy inflow and outflow at boundaries
Discussions are currently under way on constructs for the language
which would make such invariant checking more convenient.
6.1.10 Dia@nostics
All of the FMP shall be diagnosable. Creating diagnostics is
difficult at best, because of its interdisciplinary nature. Ha[d-
wale features for aiding diagnostics must be designed. The diag-
nostic programmer must be expert both in the logic design of the
machine being diagnosed, and expert in programming at the machine
dependent level. Completely automatic diagnostics for all condi-
tions is an unreasonable goal. This project would plan on
computer-assisted diagnostics.
The built-in fault detection mechanisms of the FMP have already
been discussed. In order to meet the desired goals and avail-
ability and MTTR (Mean Time to Repair) for the system, direct and
simple means for diagnosis of the system components is required.
Because of the scope of the system, direct control of diagnostics
from some central point (the diagnostic controlle_ (DC) for in-
stance) is not realistic. A hierarchy of controls will be pro-
vided. In general, every diagnostic interface to the next level
of detail in the system is expected to have a mode of operation
which allows the outer level direct control over setting and
observing any state (bits) in the immediate next level of detail.
For example, the logic in the diagnostic controller (DC) would be
6-17
tested (set DC state including command, run the DC clock, observe
the DC state) by the support processor. The diagnostic control-
let, in tuln, tests the state control logic of the coordinator in
the same way. The coordinator tests its own memory and the state
control logic of each of the processors. The processors and
coordinator jointly check the Connection Network and the EM
modules. The processors will also be able to check each other.
The coordinator also tests the state control logic of the Data
Base Memory controller. The DBM controller then tests the rest of
the DBM including the path to the File System.
Figure 6.2 shows the layered structure of the off-llne diagnos-
tics. Layer 1 is the initial phase of the "hard core", when the
Support Processor is learning to trust the command-accepting
portion of the DC. Layer 2 is the rest of the "hard core", also
imposed by a Support Processor program, checking out the DC and
enough of the coordinator so that the coordinator can be trusted
to execute successfully. Layer 3 runs on the coordinator and
exercises that portion of the FMP to which the coordinator has
direct access. Layer 4 consists of those portions of the FMP to
which the coordinator has only indirect access. The coordinator
must cause the DBM controller and the processor to execute certain
operations in order to get these portions exercised. Some
diagnostics for layer 4 run on the DBM controller and the
processors as an array.
This on-line form of diagnostics is used as needed to isolate or
confirm an error to a replaceable unit (such as a processor). At
that point the system is reconfigured, checked and execution
resumes. If the automatic diagnostics are unable to confirm the
location of a fault, the same controls are accessible to the main-
tenance personnel who can develop custom tests sequences as
required. Note that when the system successfully detects a fail-
ure, isolates it to a system component, swaps in a spare compon-
ent and resumes execution without requiring manual intervention,
the system is defined to be continuously available. Only when
manual intervention is required to isolate a problem and restart
the system is the system considered to be unavailable.
Once a failure has been isolated to a system component, such as a
particular processor, and that component has been switched "off-
line", isolation of the bad component can proceed concurrently
with the resumption of execution of the FMP. These off-line tests
would consist of two types. Some tests will be possible with the
system component still attached to the system. These tests would
allow "in-situ" testing without disturbing the environment in
which the error occurred. Thus, spare system components will be
capable of access to other spare components without disruption of
the on-line portion of the FMP.
In addition to the above test modes, test equipment is expected to
be available to test the removable system components away from the
system.
6-18
i
t
i
!
i
1
i
i
PRO(:. 1 PROC. 2
CN --
BU FF
EU
Figure 6.2
2.2 X 109 BITS/SEC,
EXTENDEDMEMORY I EM521 I
2.8 X 1011 BITS/SEC,(CABLING BANDWIDTH}
CONNECTION NETWORK (CN)
2.8 X 1011 BITS/SEC.
PROC. 512
I
I
I
DATA BASEMEMORY ".,4
DBM
CONTROLLER
>
TO FILE MEMORY
I ---_
1 1
I
I
II
I 1
l I
i I
I i
I I
1 I
I 1
I I
I I
_1 I
_._1 COORDINATOR
I
L_
I
Pc I I
ISUPPORT
IPROCESSOR
SYSTEM (SPS)
l ___ TO/FROM
DIAGNOSTIC SUPPORT
CONTROLLER PROCESSOR
(DC) SYSTEM
FMP Block Diagram with Diagnostic Layers Superimposed
t
6-19
6.1.10.1 Level of Performance
One should be aware that there is no single magic date on which
the diagnostics will be "finished". The delivery date for the
diagnostic software will merely mark a time at which the diagnos-
tics achieved have a useful level of accuracy. On that date,
there will be still room for improvement in the diagnostics.
Diagnostic programs should continue to improve as operating
experience shows up unanticipated failure modes and shows the
areas in which the accuracy of the diagnostics can be improved.
The goal is to achieve the highest possible uptime with the least
amount of time lost to either downtime or lost to running diagnos-
tics. It will be important to continue to fund diagnostic develop-
ment at a modest level of effort for the life of the NASF in order
to continually improve the efficiency of the support operations
and to reflect the design updates and changes which are a normal
part of the life of any system.
The initial capabilities of the automatic diagnostics system have
yet to be defined. The automatically executed diagnostics would
detect X% of all possible failures. The goal of this set of auto-
matic diagnostic programs is to isolate faults to the least re-
placeable unit at the FMP level (coordinator or CN card, ploces-
sor, EM module, ...). The diagnostics shall locate the failure to
a single LRU Y% of the time, and shall locate the failure to
within N LRU's Z% of the time. When a failure could be either on
the backplane or on a LRU, the probability of detecting whether or
not it is on the LRU itself or in the backplane behind the LRU
will drop to W%.
The off-line LRU diagnostics (tester programs), shall localize
failures to the chip, or to some number of chips with similar per-
centages. U% of the failures shall be found, V% shall be local-
ized to within N chips, and T% shall be localized to within one
chip or component. All of the above percentages need to be deter-
mined.
6.1.10.2 NASF Computer-Assisted Diagnostic Tools
A diverse set of diagnostics will be implemented for the NASF.
6.1.10.2.1. Support Processor System Diagnostics. The Mainten-
ance Diagnostic Unit, a separate execution unit of the Support
Processor system, can impose diagnostic operation on any off-line
elements of the Support Processor. The MDU can write information
into any fllpflop of the Support Processor, cause the unit to
execute any number of clocks, and then read the state of any
flipflops. Results are then compared to precomputed results.
%
6-20
ii
41
6.1.10.2.2. Support Processor Peripheral Exercisers. Programs
resident on the Support ploceSsor exercise the peripherals of the
Support Processor.
6.1.10.2.3. FMP Off-line Diagnostics. These diagnostics are used
when the FMP i-s not executing user programs and is considered "off-
line" in terms of production commitments. These diagnostic
procedures execute throughput the FMP depending on their purpose.
The "hard core" of these diagnostic procedures is a program resi-
dent on the Support Processor exercising the FMP via the DC.
During early debugging, the DC will be available before the diag-
nostics have been written, and some diagnostic capability will
exist by controlling the DC manually from the maintenance console.
After the coordinator has been diagnosed (or after confidence has
been gained in the coordinator), most of the rest o)J :RP diagnos-
tics will run in the coordinator. These run much faster than DC
diagnostics do. The analysis portion of all those diagnostics
runs in the Support Processor. The coordinator will check the
viability of each of the EM module controls. The EM modules will
be exercised in detail as part of the CN test.
Each processor will check its memory. CN diagnostics require the
execution of EM accesses from a number of processors acting in
concert. The CN diagnostics therefore occupy the entire array,
just as does a user program.
6.1.10.2.4. Off-line LRU Dia@nostics. These diagnostic programs
execute on the test equ-_ment. Every LRU can be diagnosed to the
chip level, or exercised with sufficient flexibiltiy that the
technician can diagnose to the chip level. The number of
different types of testers which may be required is yet to be
determined. All testers are expected to be program compatible
with each other, so that one language creates tests for all of
them. That test generation language would be linked to the design
data base.
6.1.10.2.5 PAL (Plo@rammin_ Aid for Logicians). PAL is the
language in -W-_ch simpletests can be written on-the-spot for
execution by the DC, or for execution on the B7800 for exerting
control over the array via the DC. Eventually, the PAL programs
would form a library that would continue to be useful after
delivery, especially for the small residue of failure modes which
the automatic diagnostics do not adequately support.
6.1.10.2.6 Analysis of Logged Errors. Tables which contain the
error logs would be periodically collected and provided to a
program which analyzes and summarizes the error activity in these
logs. This program would execute on the Support Processor.
6-21
i (
6.1.10.3 CN Diagnostics
The Connection Network represents a design which is novel when
compared with previously existing circuitry for which diagnostics
have been generated. The diagnostic approach described below
would allow the FMP to isolate faults to the bit and node within
the CN. Since the approach is a successive-refinement technique,
some savings may be gained by stopping the FMP automatic diag-
nostic at the board level (the replaceable unit) and isolating the
failed chip using off-line test equipment.
6.1.10.3.1 Assumptions and _ Requirements. The following
features of the FMP design, and of the CN portion of it, are the
basis for what follows.
- SECDED is checked on the data. The checking is performed
in processor or coordinator
- Parity is checked on the addresses and operation codes sent
from processor or coordinator to a single BM module.
- The Omega network, from which the CN design is derived, has
one and only one path between a port on one side and a port
on another, so that when an error is detected, the path
through the network taken by that erroneous data is known.
- When processor number and EM module number are known, the
operating system can translate these numbers into CN port
number on the processor side and CN port number on the EM
side. In general, errors will be reported by processor
number and EM module number, whereas the diagnostics need to
know physical CN port numbers. This translation needs to be
done not only for CN diagnostics, but also for processor and
EM module diagnostics as well.
6.1.10.3.2 Localizin 9 a Hard Error in the CN. The analysis of
the CN starts with the analy_i_th-6 Omega network. The argu-
ment will then be expanded to the more complex case of the actual
CN. Between port n on one side and port m on the other side,
there is a fixed path. All traffic between these two ports takes
the same path. Between some other ports n' and m' there is also a
fixed path. None or some of the nodes on the path n'-m' are the
same as the nodes on the path n-m. Inspection of the four binary
numbers n, m, n', and m', bit by bit, will disclose in which of
the ten levels of logic in the CN do these paths have common
nodes. By choosing two paths n-m and n'-m' which have some nodes
in common, and finding that the same error occurs in data travers-
ing both such paths, we localize the fault to those nodes.
6-22
Given a particular path from m to n, and knowing which contiguous
levels of the ten logic levels we want to include in some other
path, we make m' different from m for all those bits corresponding
to levels on the left side of the Omega for which the path is not
to be identical, and we make n' different from n for all those
bits colresponding to levels on the right side of the Omega for
which the path is not to be identical.
Presumably the diagnostics will be written using a binary search
strategy. First we run tests in which the faulty path and the
other path have four nodes in common, then tests in which they
have two nodes in common, then one.
In the preferred CN version, there are two Omega networks, not
one, with the result that the path is unique only to within a pair
of nodes, (one in the upper Omega, one in the lower omega) at each
point. Two paths that must intersect in the simple Omega can pass
each other without using the same gates, if one uses the node in
the Upper Omega network and the other uses the node in the lower
Omega network. The CN would be designed to inhibit this redun-
dancy. If the two Omega networks communicate only at the ports (a
version that was simulated on the CN simulator), we use a diagnos-
tic control that disables either the upper or the ower Omega
while the other one is exercised.
If the two Omega networks allow paths to be connected between
upper and lower network at each pair of nodes, then diagnostic
disable/enable controls are needed on both Omegas at all ten node
levels, twenty such signals in all, so that at each node level one
can force a path to stay in the same (upper or lower) Omega net-
work, or force it to jump (from upper to lower or vice versa).
With these controls, all paths can be exercised under the same
diagnostic scheme as described for a single Omega.
The error detection used by the diagnostics is the SECDED check on
words that have been sent through the CN in one direction or the
other, and the parity check on addresses and commands sent from
processors to EM. Now a given SECDED error could be due to an
error on the path to EM during a write, or due to an error in the
EM module itself, or could be due to an error on the path from EM
to processor. The diagnostics must distinguish between these
several cases. A test on EM module M consists of writing into the
possibility of faults in the CN). Whe_ the memory is checked out,
the diagnostics can tell the two directions in the CN by sending
data between the EM module and several different processors. To
make sure the failure is not a write failure if the read appears
to fail, write commands to the EM would be generated from several
processors. Likewise, redundant reading is used to check for
write failures.
6-23
0The diagnostics must detect the case that the EM port number is
being erroneously interpreted in an EM access request so that the
fault also causes one to traverse the wrong path. This case is
detected by the parity check at the EM module which covers module
number as well as address and opcode.
6.1.10.3.3 Dia@nostic Generation Scheduling The schedule fo_
creating diagnostics is b0ntralned by the requirements of fabri-
cation, debugging, and system integration. The first facility
needed is the test equipment, with enough of the test equipment
software written to facilitate manufacturing acceptance testing of
the LRUs as they are built. The first LRUs built are the proces-
sor and the DC boards. Fabrication generally follows the same
sequence as the diagnostics: the DC is completed first, the
coordinator is completed before the last processor is plugged in,
EM integration (including the CN) follows successful processor
operation, and DBM is the last item to integrate. However,
because of the number of processors involved, processors must be
among the first components fabricated. On this basis, we see that
the sequence of creation of the diagnostics is
i. Tester and tester programs start first. The first tester
programs written are for DC boards and processor.
2. Processor on-line diagnostics to run on the processor
while the processor is on the tester. This is an early
version of the same processor diagnostic test used for FMP
automatic diagnostics
3. The PAL assembler. This is used to generate tests
on-the-spot by the debugging logicians as they debug the
coordinator, the fanout boards, and the DBM controlle[
4. On-line diagnostic tests are used to verify proper design
and operation of the FMP. The on-line tests a_e used as part
of the acceptance tests.
6-24
J/
• w•,
6.2 RELIABILITY, AVAILABILITY AND MAINTAINABILITY
The efforts in reliability, availability and maintainability
during this study addressed the following key areas:
• The effects of redundancy and parts quality on the
FMP reliability.
. An updated and refined reliability and availability
analysis of the FMP and the NASF system
• An estimate of the maintenance manpower required to
support the FMP
The redundancy study showed that the use of redundant processors
and extended memory modules and redundancy in the data base memory
provided significant improvements in FMP availability and especial-
ly mean up time (MUT). The level of redundancy studied is now
incorporated in the FMP architecture presented in Chapter 5. The
use of B-2 quality level components versus C level quality was
also shown to make a significant improvement (B-2 and C level
component quality represent levels of quality resulting from dif-
ferent degrees of testing and screening; discussed in more detail
in Section 6.2.3). The refined predictions of FMP (and NASF) reli-
ability are based on the incorporation of these conclusions• The
results of the refined reliability analysis of the NASF are summar-
ized in Table 6.3 which presents mean up time, mean down time and
availability of the three major subsystems of the NASF.
The refined reliability analysis of the FMP considered a range of
failure rates for the LSI memories, as well as a range of improve-
ment resulting from the application of SECDED, a range of inter-
mittent or "soft" failure rates, and a range of efficlencies for
recovery from interruptions. The results of this analysis provide
three levels of reliability for the FMP. A lower bound (or worst
probable case),
probable case).
probable case•
MEAN DOWNTIME
AVAILABILITY
a probable case and an upper bound (or best
The results shown in Table 6.3 are for the
TABLE 6.3
NASF AVAILABILITY ANALYSIS
FMP FILE
MANAGEMENT
SUBSYSTEM
SUPPORT
PROCESSOR
St_SYBTEM
COMPOSITE
14.9 HRS 19,310 HRS 263.0 HRS 14.1 HRS
0.14 HKS 1.9 HRS 0.68 HRS 0.17 HRS
0.9904 .9999 .9974 .9880
6-25
An examination of the FMP reliability analysis results reveals
that the connection network (with no redundancy) has a signficant
impact on the FMP failure rate. Similarly the reliability of the
data base memory becomes a determining factor in the FMP reliabi-
lity if only the lowest probable SECDED improvement factors are
achieved and the failure rates of the LSI memory circuits (256K)
are no better than that assumed for the worst probable case.
Future efforts regarding the FMP should give careful consideration
to these two areas to achieve optimum FMP reliability.
The maintenance analysis revealed that to have a 95% confidence of
meeting the required repair and maintenance actions of the FMP for
any given seven day week, a minimum of 13 maintenance personnel
working five shifts each must be available (65 8-hour shifts).
Ass_ing 21 shifts per week (3 shifts per day x 7 days per week),
the maintenance of the FMP will require an average of 3 persons
per shift, excluding operators, administrative and supervisory
personnel.
6.2.1 Reliability/Mailability Model
A generalized systems model for predicting the reliability and
availability for a computer system includes many elements. Figure
6.3 describes this general model for NASF. There are five major
elements: facility, personnel, software, hardware and miscel-
laneous. While all of these elements impact the ultimate system
availability, the analysis and predictions conducted at this time
consider only the hardware and some interruptions of a "soft" or
intermittant nature contributed by the other elements.
This model, as well as the reliability block diagrams of NASF
elements illustrated later on in this chapter, illustrate the
inter-dependency of the subelements that contribute to the reliabi-
lity of the system under consideration.
@@___$OFTfAN( HANOW_
8UILOING OP[NAIIO_
INPUTPOW(A MAINT(NANC( j(NYiRONM[NTAL US(R$
CONTROL I
ANAG_
NASF [L£M£NTS CONSIO[N(O
G(N(RAL RELIABILI TY/I
Figure 6.3 General Reliability/Availability Systems Model for NASF
6-26
i
"i
The NASF hardware includes five elements, (I) the FMP; (2) file
management subsystem; (3) support processor subsystem; (4) the
data communication subsystem and (5) the test and maintenance
equipment. The data communications subsystem consists of a large
number (over 100) of terminal interfaces, modems, networks and I/O
devices. The data communication processors are included in the
support processor subsystem. Failure of any one of the devices or
interfaces in the data communication subsystem has no impact on
the availability of the NASF for the other devices and does not
impact the availability of the rest of the system, therefore the
data communication subsystem portion of the NASF hardware was
excluded from the study.
The availability of the test and maintenance equipment can be
adjusted to a level, through the use of redundant equipment, that
will have little impact on the overall system availability. The
remaining hardware elements of the NASF (I) the FMP, (2) the file
management subsystem and (3) the support processor subsystem, are
addressed in this analysis. Detailed models (reliability block
diagrams) of each of these elements are provided later in this
chapter.
Proglams developed by the Burroughs Corporation to aid in design-
ing fault-tolerant computers were used with the above models to
determine the system/subsystem rellability/availability/maintain-
ability. Details of these programs and definitions of terms are
included in Appendix D.
6.2.2 Redundancy Study
The FMP architecture consists of parallel elements in a number of
areas. Parallelism readily permits the use of redundancy for
improving availability. Redundancy however can also impact equip-
ment and maintenance cost, increase failure rate and frequently
increases the software and operations complexity. An analysis was
conducted to compare the effects of the application of redundancy
to the FMP in three areas: (i) the processors, (2) extended
memory, and (3) data base memory. These areas represent three of
the major areas of the FMP.
Calculations of the mean up time (MUT), mean down time (MDT), and
availability (A), of the FMP were made with various combinations
of redundancy. The level of redundancy used is that discussed in
Chapter 5 This includes 4 on-line processors resulting in a
redundancy of 128 required out of 129 processors available in each
of 4 processor bays; 4 on-llne extended memory modules resulting
in 130 out of 131 extended memory modules for 3 of the 4 extended
memory bays and 131 out of 132 extended memory modules in the
forth bay and a partitioning of the data base memory into 4
sections of which any 2 are required for the FMP to be available.
6-27
6-28
Table 6.4 shows the results of this reliability and availability
analysis fo, eight different combinations of redundancy (listed as
Cases I thorough 8). The power of redundancy in improving mean up
time and availability can be seen by reviewing these l esults which
a_e summarized in figu;'e 6.4. The use of redundancy in only one
area makes only a modest improvement in the mean up time and
availability. Use of _edundancy in two areas increases the mean
up time and availability_ somewhat more. The use of _edundancy in
all three areas _esults in a significantly hlghe, mean up time and
availabil ity.
TABLE 6.4
Effect of Redundant Elements on FMP Reliability
REDHNDANT EI_,EME'N'I._ tll'_'%NIJP TIMI_ _ _ T.[M|._
I-_.',(TI:.'NDI'I) DATA _1': (t']_.'/Jl_'_) (H(_II:_)
C;%_: PI_OL"F.'_SORS HEMOI_Y M_:MOKY
AVAI L,AI)Ih[TY
1 NO NO NO 10.2 0.65 .9403
2 YES NO NO .[5.6 0.43 .9730
3 NO YES NO 12.7 0.25 .9449
4 NO NO YES .[6.2 0.72 .9575
5 Y_ YES NO 22.3 0.5! .9878
6 _S NO Y_kq 36.2 0.33 .9908
7 NO Y_ Y_ 23.5 0.92 .9622
8 YES YES YF_ I'[7.9 0.51 .9956
It should be pointed out that the data base used for these
calculations does not include all the factors used in the analysis
reported elsewhe,:e in this chapter. The results presented in this
section should only be used for ascertaining the sensitivity of
the FMP reliability and availability to redundancy.
The conclusion of this study was that the application of redun-
dancy to these three areas to the extent defined, represent a
significant imp|_ovement in FMP reliability and availability. The
predicted reliability and availability values fo_: the FMP and NASF
presented in this chapter are predicated on the use of this
redundancy.
120
_00
MEAN UP
T_t_E CHRS
BO
6O
4O
2O
CASE:
_. NO REDUNDANCY
2, REDUNDANCY iN PROCESSOR
_, REDUNDANCY iN EXTENOEO MEMORY
4,
5,
6,
?.
S,
.... DATABASE MEMORY
.... PROCESSORAND EXTENDED MEMORY
........ DATA BASE MEMORY
" EXTENDED MEMORY AND DATA BASE MEMORY'
ALL THREE AREAS
6
""1
s i 7
0 _ 2
NUMBER OF FMP SUBSYSTEMS FOR WHICH REDUNDANCY IS APPLIED
Figure 6.4 Effects of Redundancy on FMP Mean Up Time
I
I
6-29
6.2.3 Component Quality Study
Various levels of component quality are available for fabricating
electl:onic systems. These levels are achieved through the applica-
tion of certain screening and testing procedures as called out in
various government specifications and statements. Levels most
likely to be considered for the FMP are B-2 and C. The B-2 level
J:epresents the vendors equivalent of a number of these screening
and testings procedures including a 168 hou_ burn-in. The C level
has less stingent tests and no burn-in but is done specifically to
the government specifications (See reference [9] fo_: more
information).
The effect on FMP reliability of using B-2 level versus C level
quality components was investigated. Table 6.5 presents these
results. Four cases were analyzed. The quality level was varied
for the FMP with non redundancy and with the level of redundancy
presented in section 6.2.2 above. It is noted that the higher the
Mean Up time the greater the iapact of component quality. The
conclusion of this study is that if a high reliability in terms of
mean up time is desired, higher quality (B-2 level) components
should be used. The predictions of the FMP and NASF reliability
and availability presented later in this cahpter are predicted on
the use of B-2 level quality components.
Table 6.5
Effects of Component Quality on FMP Reliability
QUALITY MEAN UP MEAN DOWN
CASE LEVEL REDUNDANCY TIME TIME
(HOURS) (HOURS)
AVAILABILITY
1 B-2 NONE 10.2 0.65 .9403
2 C NONE 8.9 0.68 .9296
3 B-2 YES 117.9 0.51 .9956
4 C YES 75.0 0.51 .9932
6-30
6.2.4 FMP Reliability and Availability P,:edlction
Since the FMP is the most complex element of the NASF hardware and
since the concept under consideration involves highly state-of-the-
art technologies, a more detailed analysis has been conducted on
this element. As described earlier, a number of factors have been
considered in the FMP analysis. The value of these factors are
varied over a range to provide an upper and lower bound as well as
i
(I
probable values for the reliability and availability. Figure 6.5
shows the rellability/availabillty block diagram used for the FMP.
In addition to the redundancy shown, a B-2 quality level (in
accordance with MIL-HDBK-217B) [9]was assumed for the integrated
circuits and 6 minute recovery time assumed for manual operator
restart. The mean times to repair (MTTR) are based on past exper-
ience and the estimated complexity for isolating and correcting
a failure in the various elements.
Figure 6.5 points out the major redundant elements of the FMP. It
should be noted that no redundancy is shown in the connection
network. The reliability/availability analysis assumes a single
layer network. The connection network presented in Chapter 5, is
double layer network. A double provides some unknown level of
redundancy since one of the purposes of the double layer is to
provide alternate paths where blocking occurs between the pro-
cessors and extended memory. At least some failures will appear
as blocking to the network. Therefore some degree of redundancy
(or fault tolerance) is available in the double layered network.
Since the degree of redundancy from the double layer network
cannot be identified and taken into consideration in these
analyses, a single layer network and the component count of a
single layered network was assumed.
The failure rates of the individual FMP elements were determined
by using a tentative parts list for each element. The quantity
and failure rates for each component are then applied to straight
forward calculations which result in the element failure rate (or
mean time between failures). Appendix E contains the figures
listing the data and the resulting element failure rates. The
failure rates of these elements and their estimated mean time to
repair are then used with the DESIGN Program, described in
Appendix D, along with other factors to be described, to predict
the FMP reliability and availability.
Not all of the factors that impact the reliability and availabi-
lity of the FMP can be readily delineated. Four factors were
selected for which a range of values could be projected and used
for the FMP reliability predictions. These four factors which are
discussed in the following sections are:
(I) LSI Memory Failure Rate
(2) SECDED Improvement Factors
(3) Ratio of Permanent Failures to Intermittant Failures
(4) Recovery Efficiency
6.2.4.1 LSI Memory Failure Rates
Actual field data on LSI memory failure rates is relatively
sparse. Some data is available on 16K devices [8]. Reliability
models such as those in MIL-HDBK-217B [9] for predicting device
failures generally do not hold for significant increases in com-
plexity and density. A worst probable case (lower bound) failure
6-31
COORDINATOR
COORDINATOR
MEMORY
FANOUT TREE
EXTENDED
MEMORY
FANOUT
PROCESSOR
MODULE
I
PROCESSOR
MODULE
' P:OC:  OR
__1 128
I MODULE
PROCESSOR
MODULE
12._.
" 129
PROCESSOR
MODULE
PROCESSOR
MODULE
I
PROCESSOR
MODULE
I
CONNECTION NETWORK
EXTENDED
MEMORYIll 13__ 131 f
EXTENDED
MEMORY
t EXTENDEDMEMORY
EXTENDED
MEMORY
EXTENDED
MEMORY
I
EXTENDED
MEMORY
EXTENDED
MEMORY
I 13>,.
1 "132
EXTENDED
MEMORY
1
DATA
BASE
MEMORY
CONTROL
DATA BASE MEMORY BAY
I
I
I
DATA BASE MEMORY BAY
I POWER SUPPLYI1 5I
POWER SUPPLY
Figure 6.5 FMP Reliability/Availability Block Diagram
6-32
irate may be assumed by using the conselvatlve estimate of .4
failure per million hours (FPMH) for a 16K device which is equiva-
lent to the failure rate of four 4K devices (which have an
accepted failure [ate of .I FPMH). Using this same philosophy,
lower bound failure rates for the 64K and 256K were set at 1.6
FPMH and 6.4 FPMH respectively.
For an upper bound (best probable case), a value of .i FPMH was
set for all three memory devices. Curves showing the improvement
of MOS memory devices failure rates with maturity tend to be
asymptotic to a value in the range of .i FPMH regardless of the
density [8].
The most probable failure rate was determined using the model in
MIL-HDBK-217B for the 16K device and then doubling that value for
each quadrupling of memory sizes. This process results in failure
rates of the 16K device being .32 FPMH, the 64K device being .64
FPMH and the 256K device being 1.28 FPMH.
V
J
6.2.4.2 SECDED Improvement Factor
Improvements in reliability of the FMP are made through the appli-
cation of Single-Bit Error Detection and Correction and Double-Bit
Error Detection (SECDED) in the FMP memories. The mathematical
model discussed in Appendix B of reference [2] determined that
gains could vary from a lower bound (worst probable case) of 2 to
an upper bound (best probable case) of 164 for 16K, 327 for 64K
and 653 for 256K memory packages. These two bounds represent the
extremes of the probable SECDED improvement. It is anticipated
that the real value will fall somewhere within this range. For
the purpose of this analysis a value of 50 has been selected as
being the most probable SECDED improvement factor.
The SECDED improvement factor is applied to the reliability anal-
ysis by direct division of the mem6[y devices failure rates by the
improvement factor. Note that application of the improvement
factor to the memories circuit alone, however does not consider
that SECDED also corlects transient error that may occur from
other sources. For example, transient single bit errors occuring
in the connection network, or due to software errors or due to
noise problems in data being transmitted to a memory may be cor-
rected through SECDED.
6.2.4.3 Ratio of Permanent Failures to Intermittant Failures
Burroughs field data has shown that the ratio of the mean time
between permanent failures (MTBF(P)) to the mean time between
intermittent failures (MTBF(1)) is estimated to vary over the
range of about i0 to 1 to 1 to I. These values have been selected
for the lower and upper bound and the ratio of 5 to 1 selected for
the most probable bound. The value of 5 to 1 for the
MTBF(P)/MTBF(I) corresponds to the assumption that 5 out of 6
failures are due to intermittents.
6-33
6.2.4.4 RecoveryEfficiency
The FMPlike other large systems should be able to automatically
recover from intermittant failures and in some case permanent
failures. The system recovery should be designed with the goal of
being 100%efficient, that is to say that 100%of the time after
an interruption of the system automatically reconf{gu,:es and
restarts with negligible downtime. Unfortunately most systems do
not enjoy this idealized goal. Experience shows that recovery
efficiency varies and ranges in the levels of 70% to almost 100%.
These levels (70% and 100%) were selected as the lower and upper
bounds and a level o£ 80% selected for the predicted level.
6.2.4.5 FMP Reliability Analysis Results
The values of the various factors discussed above were used as
inputs to the FMP model and reliability analysis program. Figure
6.6, 6.7 and 6.8 present the input data and the calculated results
of this analysis for the lower bound (worst probable case), the
probable case and the upper bound (best probable case). The input
data include the following:
6-34
I) Name: Abbreviated name of an FMP element
2) R: Minimum number of elements required for
FMP to be available.
3) N: Number of identical elements available
4) MTBF(P): Mean time between permanent failures
5) MTBF(1): Mean time between intermittent failures
6) SPFM: Single point failures (not used in this
analysis)
7) DRT: Device repair time
8) SRT: Single point repair time (not used in this
analysis
9) RE(P): Recovery efficiency from permanent type
failures
i0) RE(I): Recovery efficiency from intermittent type
failures.
Ii) DMRT: Device manual recovery time (assumed to be
.I hours for the FMP)
12) MTBME: Meantime between maintenance errors (not used
in this analysis)
13) MTBPMI Mean time between maintenance actions (not
used in this analysis)
14) MTTPM: Mean time to perform preventive maintenance
(not used in this analysis)
The output data consist of the following three items:
I) MUT: Mean up time
2) MRT: Mean repair time (which for the system being
analyzed will be the same as the Mean down time (MDT)
3) Avail: Availability - Percent of time that system or
required elements are available for use.
o_o_o_oo_ooooo_ooo
_oooooooooooooooo
oooooooooooooooo
ooooooooooooooo
lle
• , , ° . . . . • • • . • • • •
oo oooo oooo
• ° , , • • • • • , • ° . ° • •
 oo§............§
...... .. ooo=o
_oooooooooooooooo
_ooooo ooooo
0_ • • , • * , * * * ° * * • • • •
oooooooooooooooo
oooooooooooooooo
0
)4
1
.M
>1
e-i
,el
°_
4_
.,-.I
kO
_D
i11
_ .,-_
.°
ii
t,
., ' ..... ._'_t/'/T. O_ _/./._
_¢; _/|fiT'
6-35
ooooooooooooooooo
o
m_oooo moooo_om
_ooooooooooo_oooo o
_0,...,......°,°.
__ooo_ _
_ooooooo_oooooooo
o_oooocooooooooo
_oooooooooooooooo
oooooooooooooooo
oooooooooooooooo
_oocoooooooooo_oo
.000,0,0,00,00_,
o_ooo_ooo_oooooc
_oocoooooooooo_ooooocooooooooo oo
_ooo_mm_oo_ _o
.°...,.°°°*°*°.,oooooooooooooo_o
oo ooooooo oo_o
_ggggggggggdggggg
_oooooo_oc
_ooo_ooooo_oo_oo_
_oooooooooooooco
_ooooooooc_oocoo_°,,,,...,,,,,,..
oooooooooooo_ooo
,_ t_:_'- _.:_,.,
u)
c.)
,-4
,0
0
I
{o
,<
-,-I
,.Q
_>
._-I
°_
1"-.
_ 0
-,-I
..
÷
6-36
......... _o_o
_gddddddd_gdgg_g;
C
'_',t)' '
0
)4
!
rd
.lJ
°_.1
,-4
)
°
'_ f,4
. _:-:ii,VIW. OF' THE
,/.()_,: iG PO()i(
)l
6-37
0Unless otherwise noted, all times are expressed in hours. More
discussion on these terms can be found in Appendix D.
Examination of this data shows that the two major areas having
greatest potential impact on the FMP reliability are the connec-
tion network (CN) and the database memory (DBM). The connection
network, which for this analysis is assumed to have no redundancy,
has the lowest MUT for both the most probable case and the best
probable case. In the worst probable case the connection network
has the second lowest MUT, the lowest MUT being that of the data
base memory (DBM). Two factors contribute to the low MUT for the
DBM in this case; the failure rate of 6.4 FPMH, and a SECDED
improvement factor of only 2.
Conclusions from this analysis indicated that redundancy should be
implemented for the connection network. Fur thermore, special
attention should be paid to the design and application of SECDED
to the data base memory and in obtaining LSI memory circuits with
a failure rate significantly less than 6.4 FPMH.
Table 6.6 summarizes the reliability analysis results for the FMP
and shows the values of the factors considered in the different
cases.
%
6.2.5 Support Processor and File Management Subsystems
Figures 6.9 and 6.10 show the reliability block diagrams of the
support processors and file management subsystem. The high level
of redundancy in these systems contributes significantly to its
overall reliability. For the purpose of this analysis, the
failure rates of the individual models include hard and inter-
mittant failures. The failure rate data used for the support
processor elements and the disk packs and file control ements of
the file management subsystem are obtained from current field
experience on similar systems. The equipment that might be used
in the 1980's, though faster and of greater capacity than that in
the field now, is expected to have reliabilities and
availabilities that will equal or exceed that of these current
systems.
Figure 6.11 lists the data used for support processor subsystem.
Computed outputs for the mean up time (MUT), mean down time (MDT),
availability and the meantime between interruptions for the indi-
vidual items and the total support processor subsystem are shown
in Figure 6.12. The CONFIGURE program described in Appendix D was
used to generate these results. An example of actual field data
of a similar system is provided for comparison in Figure 6.13.
6-38
Hi
_1_
rM
_h
I.n
_h
+ ,
in
r_
iiiI I I
_rJ
i
_ O e-,4
I I I
i
I I 1
_Q
I,,e. _
e.-t
6-39
6-40
Lu
,,=._
t
'Ii
t
2 =
t
t
i
,_o_
,,=,_
_.o_
ouu
.,=,
]
"8
__ira__
t
N
_J
r-4
r_
r_
-,-I
,---t
.,-4
.-q
.,--I
,--t
N
m
ffl
O
ul
m
_N
U _
O_
_4 t_
O_
_0
m_
ko
qJ
RLE CONTROLS
I
Figure 6.10 File Management
Block Diagram
k
DISK PACK
DISK PACK 1
DISK PACK _
DISK PACK
DISK PACK
DISK PACK
MASS MEMORY
J
3 OUT OF 6 DISK PACKS
REQUIRED WITH
MASS MEMORY AS BACK UP
=>
Subsystem Reliability/Availability
6-41
oooooooo
_gd_ddd
_o_oood_
c_ :b.: t'--:,
0
.e.I
j...i
0
4-1
lib .,--I
4J
0 o o
,s= _ _._
_ 0 _
_00
_ ,u,.40
0 ..iJ_-.__
O0
H It II II II
_4
.,-I
--'1
o
o
_4
¢_
t_
.p
u
o
_J
4-1
_J
ro
o
_B
4_
.p
_-i .t4
,t--_
_J
_o
---I t_
o
o
40
0
0_
6-42
i:i
,=
a
u
I)
+
0
m)_q
W
D 04
Q_
m
Q
o
D
I),,.4
D
I)
o
11)
q)
m
D
0
e
m_
I)
(DO_IOOOOO
OC) O10_3OOO
• _ • I i i I •
OOc3OC) OOO
O _f O_. o_ O O O c,J
o4 00 _._ o_ O O O O
I • • • • i •
00000000
_*. tO 0 -_ 000 C.)
_'_ ¢q Ol ¢00 _, 0 C)
oooood_d
o oo _n o _oto
._ Ol oo _ u9 oo o_ l_
OOOOOOOO
• • i) • • • • •
¢q ,r_ 00 (o u') i_. ,r_ cq
tj _L)_j_L_
o_
_O
c}
O
09
uO
c_
II II II U II II li U II
A
v
o
_m
0--4
4J
m
.to
_o
• o.,-i
:3 U).,_
D_On
-40
_ 0 ,--¢
01>,
_,--4
,.0
.-4
o
6-43
I
)
V)
Z
V)
V)
Z
0
_,* V)
t,-
_J
.J
bJ
10
W
_e
*,,4
¢D
,MI
se
W
2_
imp
G.
**J
9:
i-
L)
q[
0
,4'
so
i.r)
¢[:
2:)
0
2:
¢_
W
(D
t_
SJ
V_
6-44
..J
.J
u9
0
_@00000 NO 000 O0000
•, ,:1
000 00000 (DO *'q I.q U% U9
0000000 _0 000 00000 O0 _J
e e 04
o
o 0 N
U% d)
c_ e
0
,.0 ,4' -*
eiiiOt_ tl itt tii0_ io
b_
W_,.m.,..b.,.a[_C:l_bd._C3 ¢_u',.,,__JO_J,,.m..l :_ W
_J
7
Z
O. 0 ¢0
.J
W W _J
It li If II II II
_n
rn :_ _n
o
•:i{(:',:';,'h_ll.l'['YOF THE
':'*' PAGE IS POOR
(9
CO 0
CO-,
0
4/
0
,3_n
,-_ 0
CO4_
0-,-_
(n
CO _9
M >.
,-4
t_
.,-4
Table 6.7 lists the data used for the reliabillty/availabillty
analysis of the file management and the subsystem results of this
analysis. The MUT and MDT use for the mass memory are based on
design specifications.
Table 6.7
File Management Subsystem Reliability
Data and Analysis Results
ELEMENT
File Control
Disk Packs 3
I II R I N MUT (HRS) MDT(HRS) A
I 19,310 ! i' ' 1.9
! ,
! 6 250 _ i .0
f
_ _ ......
! i 5,246 I .8
i ........ j
I
System Total 1 19,310 1.9 .9999
Mass Memory 1
R = Required number of elements
N = Number Available
MUT = Mean Up Time
MDT = Mean Down Time
A = Availability
(Data from experience on similar equipment oi: design
specifications)
6.2.6 Maintenance
6.2.6.1 Maintenance Philosophy
Maintenance of the FMP should be based on a remove and
replace-with-spare philosophy at the lowest replaceable unit (LRU)
level as determined by the maintenance analysis. Repair of the
replaced failed items would be off-line using subassembly testers
available at the sit_. The FMP should be equipped with fault
detection circuits that, in conjunction with system confidence
checks and diagnostics, would provide indications of an existing
problem via a printout or status display. Errors can be logged
automatically giving appropriate file information for isolation of
failure(s). Upon detecting a fault, the maintenance personnel can
initiate the isolation action required (hardware/software/manual
diagnostics) to locate the fault to the malfunctioning subassembly
6-45
0within an element or LRU. A replacement subassembly will be with-
drawn from spares and substituted in the FMP element• Before
restoring the element to an active status, a confidence check
would be performed to determine if the failure has been corrected.
The malfunctioning subassembly can then forwarded to the appro-
priate repair facility (site, depot or factory), upon repair at
the site, the LRU can be returned to the spares stock•
The remove and replace philosophy requires that adequate spares be
stocked on-site to preclude degradation of the FMP performance
parameters (MUT, MDT, Availability). The actual quantity and
types of spares required and the lead times should be determined
from their actual usage in the equipment and their individual
failure rates.
Preventive maintenance of the FMP consist of periodic testing of
the power supplies, checking of rotating memories and general
housecleaning. This effort would be minimal, and most of it can
be accomplished on-line.
6.2•6.2 Maintenance Plan
Upon detection of a failure the system diagnostic can be auto-
matically initiated to determine the malfunctioning element• The
system automatically reconfigures under program control replacing
the malfunctioning element if it is redundant. Maintenance diag-
nostics can be initiated to isolate the failure to the mal-
functioning subassembly for removal and replacement by a spare and
the process manually restarted if the failed element is not
redundant. The design approach being investigated would allow
removal/replacement of redundant modules with power-on. This
approach would tend to reduce the equipment downtime by allowing
more rapid access to the failed items.
Since SECDED is applied to the memories, most single-bit errors
will not cause any equipment failure. When the log shows that a
single bit is stuck, the system could be shut down when desired in
an orderly fashion for maintenance action. This .feature would
provide a minimum loss of productive time. The information stored
in the log could then be processed on an as-called basis for
location of the failure or error. The system diagnostic would
utilize the following means for error detection and error
correction:
6-46
a) Processor Module
. Parity check on Microprogram Memory
• Reasonableness checks (See Appendix C for
detailed list)
b) Data Base Memory
• Error correction with logging of errors
for detecting repeated faults
°c)
d)
e)
f)
Connection Network
• Error correcting codes as part of the data plus
parity checks on address and instructions to
memo ry
Coordinator
. SECDED in the memory
. Reasonableness checks (See Appendix C)
Memories
SECDED
Power Supplies
• Detection of over voltage on input line will
cause the FMP to automatically shut down to
prevent damage to the equipment
Detection of voltage out of range on output
%
t
6.2.6.3 Personnel Support Requirements
Detection, isolation, repair and checkout of a failure in the NASF
System requires individuals with knowledge and experience of
digital processing equipment. These individuals should have a
thorough understanding of electronic principles, systems logic and
solid state component operation as applicable to high speed
digital data processing equipments• They should also have a
thorough understanding of electronic test equipment operation, and
reading schematics, logic, wiring diagrams and blueprints. Their
background should include, at a minimum, a high school education
and training in an advanced electronics digital data processing
and computer maintenance course. Maintenance personnel should
possess experience in the installation, repair, overhaul and
modification of high speed digital data processing systems and be
familiar with the test equipment applications associated with the
accomplishment of these tasks.
An analysis has been conducted to ascertain the level of manpower
required to provide for repair and maintenance of the FMP. The
results of this analysis shows that to have a 95% confidence a
meeting the required actions within the times allocated a minimum
of 13 maintenance personnel, cach working 5 shifts per wcck arc
required.
Estimates of the personnel support (labor hours) requirements for
the NASF System provided in this section assumes the type of main-
tenance personnel described above. The estimates are based on 95%
upper confidence bounds applied to element failure rates and the
6-47
weighted average repair time of a given subsystem. (An upper
confidence bound of 95% applied to the element failure rates means
that for 95% of the time the failures of a given element will be
within this bound.) The basic steps followed to determine these
estlmate_ are:
a. Determine the average number of failure expected in
a 168 hour operational week for a given element in
a given subsystem.
b. Determine the expected number of failures at the
95% upper confidence bound for each element and the
corresponding subsystem total.
c. Determine the weighted average equipment repair time
for the given subsystem.
d. Determine the labor hours expected at a 95% confi-
dence level to be expended in performing corrective
maintenance (CM) (on-line-localization, isolation,
LRU removal and replacement).
e. Estimate the labor hours for performing preventive
maintenance (PM).
f. Estimate the labor hours for LRU repair off-line
(bench repair).
g. Estimate the total labor hours required per shift.
Steps a and b
Table 6.8 shows the average number of repair actions expected
weekly as computed for each equipment in the FMP, File and Support
Processor subsystems. The weekly (168 hours) period was chosen
because it best satisfies operational conditions. The smallest
value of n; that satisfies the Poisson formula condition given in
equation 6.1 determines the maximum number of repair actions at
95% confidence for the jth element.
e-mj __ (mj) i > 0.95
it (6 i)
i=0
where n_ = t_t = _eragenumberofrepairactions
MTBFG) _rjthelementintime t
Table 6.8 shows the input values of Qty(j) and MTBF(j) for each of
the j elements and the calculated values m; and n; for t=168
hours. The subsystem total for the FMP subsystem ;hows an average
of about 11.5 repair actions per week, but at a 95% confidence
level there will be no more than 33 repair actions per weeks. The
corresponding values for the file and support processor subsystems
are about 12 and 37 repair actions per week for the average and
95% confidence bound respectively.
6-48
i-_o7
3
Step_
Determine the weighted average equipment repai_ time for the given
subsystem.
The mean time to repair for each of the j elements is tabulated in
Table 6.8 as MTTR(j). The mean time to repair a subsystem, MTTR,
is obtained as a weighted average of the MTTR(j). The weighing is
done using the quantity Qty(j), and inversely as the mean time
between failures, MTBF(j), of the jth element since these deter-
mine the frequency with which repairs of the jth element comes up
for !:epair._c__.The appropriate formula for the subsystem mean timeto
repalr MTTR, for corrective maintenance is:
E MTT__RRUL_gt__QLK(D_
MTBF0)J (6.2)
MTTR = -__._ Qty0)
j MTBFO)
When the values in Table 6.8 are used for j=l to 10, the weighed
average equipment repair time (for corrective maintenance) for the
FMP is 0.618 hours, and for j=ll to 20, the mean time to repair
for the file and support processor is 2.2 hours.
Steps d thru
A good approximation to the distribution of mean repair time is a
normal distribution. Thus, it then follows that the general
equation for determining the manpower, personnel hours PCM, for
corrective maintenance expected to be expended at 95% confidence
can be expressed as:
( )Pcm = M--_--R + _ n P (6.3)
P = Number of maintenance personnel
n = Number of repair actions
= Standard deviation of repair times
= 0,25 hours (as determined from observed data
taken on similar equipment
6-49
rCO
to
.Q
el
_J
4J
b_
U_
_n
0
k_
G;
0
g4
0
,.Q
1,4 _
IlJ _.1
0C4_0_0(_10
• • • • • Q t •
0.1
•J 0
0 _J
C_
C 0
0 _0 _
_ O0
•,_-,_ _ 0_ 0_0_0 _ _
"0"00 • _ _ C
. _,=_ _,
0
0
4-I
_0
°°.°.°_mmo_
_ _o _ 0
_ 0
0 _
_0_
_O_CO0
o _o_o
_0_
_0_ _
___0
o,I
0
,,..l
.W
_J
.,-I
'0
.lu
4u
_ .,.._
r_J
,C.,4
vO .,-i
e;
'0
_C
e_
u_
'0
tt40
_O
U_
e"Ol
6-50
i
i
i
Table 6.9
CORRECTIVE MAINTENANCE LABOR HOUR ESTIMATES
Subsystem
PCM
FMP
Support Pro-
cessor and File
Systems
No. of Main- No. of Labor Estimates at 95%
tenance Per- Repair (Maintenance Personnel
sonnel, (P) Actions wk/(n) .....Hours/Wk)£ (PcMI______
1 9 6.7949
2 16 23.0630
3 8 18.3193
Subtotal: 33 48.1772
1 i0 23.3019
2 19 88.6648
3 8 56.2929
Subtotal 37 168.2596
l
Experience shows that 27.9% of all equipment failure corrective
action is performed with one (I) maintenance person, 49.6% with
two (2) maintenance personnel, and 22.5% with three (3) mainten-
ance personnel. Substitution of these values for P into equation
6.3 yields the results in Table 6.9 .
The results shown in Table 6.10 are based on the following assump-
tions: (i) previous field experience for repair-off-line
utilizing a subassembly tester indicates a two hours repair time
per equipment failure. (2) The amount of time selected for
preventive maintenance (PM) are also based on previous field
experience. However, it is to be noted that the final time values
for PM can be better determined once the PM procedures are
devloped.
The final results are adjusted to consider the efficiency of the
personnel. An 80% personnel efficiency is assumed to cover con-
tingencies such as set-up times, breaks, report writing and other
documentation requirements, etc. These results indicate that,
with a 95% confidence, thirteen (13) maintenance personnel can
adequately support the NASF Computing System working 5 shifts each
during a 21 shift, seven day week, or an average of about 3
persons per shift.
6-51
,,_w°;
Additional personnel should be considered to account for time off
and shift rotation within established personnel policies. The
above manning level does not include those personnel required for
supervision, administration, software support, system operation
and maintenance of the data communication displays, terminals and
other I/O equipment.
TABLE 6.10
ESTIMATED NASF MAINTENANCE LABOR REQUIREMENT
Labor Required
Maintenance (Maint. Personnel
Subsystem Activity Hour s )
FMP CM 48.18
PM 14.00
Repair Off-Line 66.00
Subtotal: 128.18
File and CM 168.26
Support PM 28.00
Processor Repair Off-Line 74.00
Subtotal :
TOTAL
At 80% Efficiency
270.00
398.44
498.05 hours/week
6.2.6.4 Sparing Considerations
An important condition to the acquisition and maintenance of any
system is the philosophy of sparing parts, assemblies, and sub-
systems to support the specified system operational requirements.
Sparing considerations are developed as a result of an overall
logistics support study which takes into account requirements such
as:
6-52
II
- System MUT, MTTR and Availability,
- Redundancy considerations,
- Recovery time of [epairables,
- System maintenance philosophy,
- Hardware complexity,
- Corrective/Preventive maintenance skill requirements,
- Site, depot o3[ factory repair,
- Special and standard test equipment or tools,
required at the site, depot or factory locations
- Storage facilities (space, environment, etc.),
- Distance from spare part supply points,
- Turn-around time for repair on site, depot and factory
("Pipeline" time),
- Packaging fo_ long term storage,
- Shelf life,
- Long term availability of discrete parts due to technology
advances, etc.
- Cost tradeoff studies of repair at piece part versus
assembly/subsystem level on throwaway,
- Identification of wear out items replaced at specific
intervals.
The maintainability characteristics of any system backed up by the
reliability, availability and performance _equirements determine
the system effectiveness, logistics supportability and the cost of
system maintenance.
As new systems are developed, they become more complex with re-
spect to the sophistication of new state-of-the-art circuitry and
the application and density of circuitry within equipment ele-
ments.
Complex and large systems generally have inherently low mean up
times, therefore, a viable logistics support plan becomes a prime
factor in the operation of such systems.
As va¥ious elements of the NASF system become defined, final part
types, part quantities and catagories ultimately selected and
circuit packaging determined, a realistic and comprehensive logis-
tic support/spares study can be performed on the FMP support
equipment.
Spares are determined through a quantitative analysis which basic-
ally utilizes item failu, e rates, item population in the system
and applies various confidence levels to meet the needs of the
customer requirements and the established qualitative and quanti-
tative maintenance requirements. Burroughs maintains computerized
programs for establishing at the desi1_ed confidence level spare
part support quantities. These programs can be used to establish
the spare quantities for the FMP once the subsystem becomes better
defined.
6-53
Spare parts selected for site maintenance consideration fall into
three basic categories, namely, electronic and mechanical piece
parts, subassemblies, and assemblies classified as site repairable
in accordance with the established maln_enance philosophy.
Typical part types in these three categories are:
Piece Parts: Fuses Connectors
Integrated Circuits Pin & Socket contacts
Diodes Indicator lamps & LEDs
Resistors Blowers
Capacitors Drive Belts
Switches Motors
CRT's Misc. parts (wire, etc.)
Subassemblies: Individual Processors
Printed Circuit Cards (logic)
Power Supply regulator cards
Miscellaneous Printer subassemblies
Misc. Tape, Disc & Display subassemblie_
Assemblies: Power supplies (main)
Keyboards
Miscellaneous Printer assemblies
Misc. Tape, Disc & Display assemblies
%
Spare parts required for depot or factory repair support will be
dependent on specific items identified through future support
efforts. In addition to sparing parts required for consumption at
the depot or factory level, additional items will be required to
maintain site levels of high value and/or non-reparable items for
maintaining the "pipeline" flow to the site.
6-54
1[ i
/
CHAPTER 7
FMP TIMING SIMULATIONS
7.1 FMP MODEL
The FMP Model (Figure 7.1) includes Extended Memory (EM) Connec-
tion Network (CN), Coordinator (CR), and one or more processors
each including one Execution unit (EU), and two memory modules.
Synchronizing signals between CR and EU's are modeled, as are the
effects of CN and EM characteristics on EU instruction times.
The time resolution of the simulation is a single processor clock,
nominally 40 ns for a 25 Mhertz clock.
The simulation is detailed to the single processor and coordinator
(CR) instruction with sufficient accuracy in the models of the
various functions to give good estimates of the execution times of
actual code samples. The detail required is greatest in the
processor, where it nearly equals that to which the design has
been carried. The coordinator (CR) is modeled less completely,
but well enough to model instruction-level execution of code with
reasonable accuracy.
The EM and CN are modeled only to the extent that their perform-
ance parameters are accounted for in the timing of the instruc-
tions which use them.
7.1.1 Processor Model
Figure 7.2 shows the functions modeled in the processor. The way
these functions perform is best shown by tracing the steps in
executing instructions.
It is necessary in some cases to distinguish between functions
performed by the simulator, which use no simulation time nor
resources, and are indicted by (S), and the functions which take
time and/or resources and which correspond to actual hardware
functions, indicated by (M).
The simulated code file (S) contains one entry for each instruc-
tion of the actual code modeled. The PCR (S) points to the next
entry of the simulated code, and this entry when fetched (S)
points to a coded description of the instruction (S). This des-
cription is fetched and decoded as soon as the previous instruc-
tion starts executing. The coded information includes:
(i) Instruction length (code space taken)
(2) CU synchronizing action, if any
(3) Resources used
(IP, FP, DM, CN buffer/CN)
(4) Time of use of each of resources
(5) Reporting information if a floating point arithmetic
instruction.
7-1
%CONTROL
CR
CRM
COORDINATOR
UNIT
DATA
T
lil I
SYNC
,_,
CONTROLS
CNB
IP I
!
I FP
I
I
PM I
i ] DM
i i
I
I
I
1
I
I
I
PROCESSOR
ARRAY
EXTENDED
MEMORY
I
I
CONNECTION
NETWOR K
CN
BUFFER
INTEGER
FLOATING
PROGRAM
DATA
u
PROCESSORS
MEMORIES
Figure 7.1 Flow Model Processo[,
Showing Functions Included in Simulation Model
7-2
<_
GO
IGH +EN
ENABLE/ ._
DISABLE
FORCE
BRANCH
L_____._._ )
CONNECTIONS
TO CU
CN BUFFER FLOATING
(CNB) PROCESSOR
DATA 1 (FP)BUFFER INTEGER
PROCESSOR
I GOT HERE (IP)
(IGH)
......... HOLDING
ENABLE REGISTER
SCORE. j PROG. 
BOARD _ STACK
CODE FILE
DATA MEMORYi
I
(PM) I (DM)
I
i
HOMOGENEOUS
or (Separate) uses
EXECUTION
UNIT
PROCESSOR
MEMORY
Figure 7.2
Functions Simulated in Processor Model
.,J
0After fetching and decoding (S) the instruction, the actual
behavior of the processor in fetching, decoding, and executing the
instruction is modeled.
7.1.2 program Fetch
The processor memory is modeled as static memory, with three
clocks access time, that is, new data is available at the output
three clocks after a new address is supplied, and remains avail-
able statically so long as the address is held. The actual pro-
gram address register is not modeled, but it is assumed that the
next program address is supplied as soon as the previous program
word is fetched to the program stack, so the next program word is
available three clocks later. The initiation of program fetches
is thelefore driven by the availability of space in program stack
(M). The space that was occupied in program stack by an instruc-
tion (M) (as specified in its description) becomes available as
soon as it starts executing, and when the total space available in
program stack is enough, a code word (M) is transferred to it, and
the next program fetch is initiated.
The above program fetch sequence is subject to two exceptions_
When a jump or a conditional branch is elecuted, the program stack
is marked empty, and the progran| fetch then in progress is restart-
ed, so that the new program word is not available for execution
for three clocks. Furthermore, this action itself does not begin
until the branch instruction has been executed. The latter also
applies for a test-and-branch instruction when the branch is not
taken; the next instruction cannot start executing until the test
is completed. Alternatively, the model may be altered so that
program fetch delay occurs when the branch is __not taken, with
program execution from the new address continuing without delay
when the branch is taken.
The second exception case for program fetch can occur only when
the model of processor memory is made homogeneous, that is, both
modules are shared between program and data storage. In this case
a data access to one of the modules aborts the program access then
in progress, and the next program word froln that module is not
available for two memory cycles (6 clocks). If the data access
and the transfer to program stack are simultaneous, the transfer
is completed without delay, so the maximum program fetch delay
which can be caused by a single data access is five clocks. The
module for data access is selected at random (S). When memory is
used for data fetch or store, it is treated as a single resource,
not accessible when busy, even in the case that both memory
modules may be used for data storage and are independently access-
ible. This is because the memory addresses are always modified by
an integer register, so tl _ actual address and thus the module to
be used cannot be known ,.til after the instruction starts
execution.
7-4
iH
When memory is modeled as homogeneous, program fetches alternate
between the two modules, but only a single program address regis-
t,_r is assumed, so the program fetches from the two modules are
initiated simultaneously, and the next fetch from the first module
cannot b_ initiated until the fetch from the second module is
complete.
7.1.3 Instruction Execution
After the instruction is decoded (S), and the resources needed and
the times when their use starts have been determined, the score-
board (M) is examined to determine whether the instruction must be
delayed. The scoreboard, updated when each instruction starts
executing, contains the time at which each resource will be releas-
ed. If it is found that there will be a delay, further (S) proces-
sing waits until the resources are available. Then the content of
Program Stack is examined, and if the operator syllable(s) are not
present, the instruction queues (S) until the next program fetch
makes the syllables available. When the instruction starts, if
its use of any of the resources is specified as delayed, then that
part of the instruction must wait in the proper Holding Register
(M). If the required Holding Register is in use by the prior
instruction, then the start of execution is delayed until the
holding register is available.
Note the reversal of the actual order of operations: The instruc-
tion cannot actually be decoded until it has been fetched, and any
waiting for resources must follow this. However, we wish to know
how much the execution of code is delayed by program fetch, so we
do not count any fetch time during which the instruction would
have been waiting for resources anyway.
The actual processor probably would not use a scoreboard as above,
because this mechanism for controlling instruction overlap is not
fully effective when the execution time for instructions is data
dependent, as will probably be the case for arithmetic operations.
A mechanism similar to the holding registers would be used, where
the various parts of the instruction can wait for their resources.
An exar 21e of the difference in timing in the two cases is shown
in the timing charts of Figure 7.3. Here the instructions using
the Integer Processor (Numbers 2, 3, 5) start much sooner in (b)
than in our model (a). This is because when an instruction enters
the queue, the next instruction is available for decoding, whereas
in (a), when an instruction is delayed by the scoreboard, the de-
coding hardware is tied up until the instruction starts. However,
note that in both examples the Floating Processor is busy full
time, and the FP instruction (4) starts at the same time in both.
It is reasonably clear that the queueing mechanism will allow more
instruction overlap in some sequences, giving a reduction in
execution time, but such cases will be uncommon and only a small
reduction in total execution time can be expected.
7-5
INSTRUCTION
FP
delay
TIME ' in cIocla
5 10
FP
j del . -
15
' del
J FP
(a) Using Scoreboard
FP
queue
. _qu_e._e__
; del 1
J J
queue I FP
(b| With Queuein9 for Resources
Figure 7.3
Simple Instruction Timing Diagram,
Contrasting Scoreboard and Queueing Implementations
of Instruction Overlap
7-6
ii
The IP, FP, DM and CNB resources are modeled, and utilization of
these resources is reported. Program memory (PM) is modeled as
two separate resources when the memory is homogeneous, but is
shown to be utilized only during the memory cycle time actually
used for access. That is, neither cycles aborted by data access
to that memory, nor the time spent holding output while waiting
for space in the program stack are counted as program memory
utilization time.
A running count (S) is maintained of the number of instructions
executing, this count being updated whenever an instruction starts
or ends. Special resources (S) are used causing reports of the
percentage of the simulated execution time that i, 2, or 3
instructions are executing concurrently.
7.1.4 Synchronizing Action
The state of synchronization of the processor is described by the
state of two flipflops (M): I GOT HERE (IGH), which is set by
certain instructions, and ENabled (EN), which is set whenever the
processor is enabled, and when reset causes the processor to stop
executing before the next instruction. A logic level formed by
the logic combination (IGH or _-N), ANDed with the same logic level
from all other processors is transmitted to the coordinator (CR).
The IGH is reset when a GO pulse is received from the CR and EN is
set by an ENABLE pulse from CR (M). Cable delays for these
signals are zero from CR to EU, and three clocks from EU to CR,
because the system clock is assumed to be in CR, so that signals
from CR travel with, and arrive at the same time as the
corresponding clock.
The uses of synchronizing action are as follows: Certain instruc-
tions (FETCHEM, BDCST, HVST, SHIFCN) involve exchange of data
through the CN, under overall control of the CN by CR, with clock-
ed, synchronized data transmission. Therefore, all processors
must be at the proper point in the instruction (or disabled)
before the data transmission can begin. Such instructions set IGH
during execution and then wait for GO from CU before completing in
synchronism across the array. Since all such instructions use the
CN Buffer (CNB) unit, and the data exchange is via the data buffer
internal to CNB the processor can continue executing succeeding
instructions as soon as IGH is set, provided that none of them use
CNB or set IGH. Certain other instructions (WAIT) set IGH and
then wait for GO before starting the next instruction. Obviously,
such an instruction cannot start if IGH is already set, but must
wait for the GO to reset IGH before going on. The STOP instruc-
tion resets EN and then stops instruction execution until the
ENABLE from CR again sets EN.
7-7
i =-
ij
=_..
7-8
7.1.5 External Access Model
The CN buffer (CNB) unit contains registers for the Extended
Memory address, which are loaded from Integer Registers, and a
Data Buffer to hold data which is to be transmitted through the CN
or which is received from it. The data buffer in CNB may be empty
or full, and in either case may be busy or available. However, in
our model, the buffer is always available when CNB is available,
and is then either FULL or EMPTY depending on the last CNB instruc-
tion executed. The CNB functions are designed so that those which
transmit data through CN (STOREM, HVST, SHIFTN) specify the pro-
cessor source for the data (DM, FP register, one or more IP regis-
ters), and so appear in several versions. However, those instruc-
tions which receive data from CN (LOADEM, FETCHEM, BDCST, SHIFTN)
do not specify a processor destination, but leave the data in the
FULL data buffer in CNB, from which it is transferred by a second
instruction (-REM) which specifies the processor destination. The
reason for this is that there may be appreciable delay in using
the CN, either by conflicts in CN or at EM for LOADEM, or by
waiting for other processors in the synchronized CNB instructions.
In either case, the separation of the CNB action and the transfer
to processor destination allows the compiler to save execution
time and mask the delay time by inserting the CNB instruction as
early as possible, followed by as many other instructions as pos-
sible before calling for the data.
A CNB instruction (or -REM, which also requires CNB to be avail-
able) cannot start while CNB is still busy with a prior CNB
instruction. In our model, succeeding instructions, therefore,
must also be delayed, although in a queueing model, having a
register for queueing CNB instructions, succeeding instructions
not requiring any of the same resources might continue execution.
7.1.6 Branching
we model only the execution time of instructions, and the model
does not "know" what they do, nor do the modeled instructions
contain any data or addresses. Therefore, branching must be con-
trolled by special code words in the simulated code file which do
not model any actual instructions but contain coded branch control
information to be used by the simulator. Such words can be insert-
ed anywhere, and their processing does not take any time nor
resources. However, every time a branch is executed, the simulat-
ed code addresses are reported, in order to allow tracing the
execution of the first simulated code. The branch controls oper-
ate as follows: The first time a branch control is executed, a
processor subroutine is initiated to process it. The code word
contains algorithms to specify:
(a) The branch address, which maybe dependent on processor
number if desired,
(b) The repeat number, N, which may be dependent on processor
number, and which may be calculated from a probabolistic
algorithm if desired,
(c) The kind of branch action desired. Program control may
either drop through N times and then branch or may
branch N times and then drop through. In either case,
when the branch control has been executed N+I times, the
simulator subroutine terminates, and the next time the
control word is executed the process is reinitiated.
(d) Special CALL/RETURN constructs are available for sub-
routine calls.
7.1.7 Coordinator
The coordinator (CR) has its own instruction set and simulated
code file. However, the CR is modeled in less detail than the EU,
so the simulated instruction description requires only one coded
word containing six parameters: Three kinds of synchronizing
actions, size of instruction, CR memory action, and CR processor
time. The processor is modeled as a single unit, so no instruc-
tion overlap is modeled, except for memory access portions. The
CR memory is modeled as a single module used for both program and
data. The interaction is simplified by assuming that program
fetch is initiated only when there is space for the new word in
program stack (two words capacity), and once initiated a program
fetch is not interrupted by data access. There is usually little
contention for memory, since data access in CR tends to be infre-
quent.
Branch control in CR is implemented in essentially the same way as
in the EU, except, of course, there can be no dependence on proces-
sor number.
7.1.8 Extended Memory Access
The new Connection Network is used asynchronously, which has
important advantages over the synchronous TN when the pattern of
EM access is different in different EU's because of data dependent
branching, or when the pattern is not a P-ordered vector. How-
ever, internal conflicts in the CN, or multiple requests to the
same EM module can cause some accesses to be delayed. The delays
are determined by the actual pattern and timing of accesses across
the entire array. However, within the framework of this simulator
it is impossible to model these delays exactly because:
7-9
(i) Wemodel only a few processors (usually one), not 512.
(2) Wemodelonly the execution time, not the actual data
processed, so the access patterns are not modeled.
(3) The simulation model for the CN is complex, so that it is
impractical to incorporate it into the FMP model.
Therefore, the CN delay is modeled as a probability distribution.
The nominal distribution is exponential, with five clocks expected
value. Separate simulations (see Appendix B) of random accesses
to Extended Memory through the Connection Network under maximum
possible load conditions indicate an average access delay of about
one CN clock (three processor clocks). Since the CN simulator has
not been run long enough in any test case to reach steady state,
we assume that the distribution of delays may have a longer tail
than the exponential, so we approximate this worst case by an
exponential distribution with four clocks expected value (and four
clocks standard deviation). The standard deviation of the average
delay for 100 accesses is therefore 0.4 clocks, so that the worst
case out of 512 processors would be expected to have an average
delay 2.9 standard deviations greater, or 5.2 clocks.
Extended-memory-access instructions are thus modeled with the CN
delay added to normal execution time of the instruction. The
decoding of the next instruction and its overlapping execution (if
possible) is not delayed. The execution of Processor code is
delayed by instructions which use the CN buffer only if the CNB is
still busy with the last such instruction, since the EM accesses
are managed by CNB without interfering with the use of any other
processor resources. Delay is minimized by ordering the code so
as to interpose other instructions between CNB uses. In parti-
cular, the -REM type instruction which uses the data fetched to
the Data Buffer in CNB by an EM access, is placed as late as
possible in the instruction stream. By these means, the code (FX
subroutine) which suffered the largest delay from waiting for CNB
was delayed only 11.4% (see Sec. 7.2.6.2, and Table 7.2). In this
case, reducing the contention delay discussed above from five
clocks to ½ clock increased the throughput only 4.0%.
7.1.9 Simulation Results
The primary information provided by the simulation run is the
elapsed time required to run the simulated code, and the number of
floating arithmetic operations performed, which together give the
throughput in floating operations per second.
7-10
a-
Much additional information is reported, such as the total exe-
ctuion time of the arithmetic part of floating point instructions,
delays caused by branching and program fetch, utilization of the
various processor and coordinator resources, and the extent of
instruction overlap achieved in the processor(s). This infor-
mation can be useful in understanding why the processing of the
code behaved as it did, and is useful in guiding the details of
hardware and software design.
7.2 SIMULATIONS PERFORMED
The codes segments simulated were selected from the Hung-
MacCormack explicit and the 3-D implicit aero flow codes, and from
a GISS weather code. The criteria for code selection were, first,
to select a range of types of codes to cover a wide range of flop
throughput and of factors influencing the throughput, and second,
to include from each code samples typical of those portions which
account for the major portion of the execution time of the
program.
By comparing each block of code in an entire program with the code
segments actually simulated, it was then possible to estimate
throughput for that block, and by proper weighting, to estimate an
average throughput for the entire run of each program. These
estimates are probably on the low side because the parameters used
in the simulation model are conservative:
0
0
(a)
(b)
(c)
(d)
The assumed 40-ns clock period is ample for ECL logic.
This allows safe, conservative logic design, and actual
detailed design may show that a slightly faster clock is
feasible.
The execution time for arithmetic instructions is
assumed constant. If the instruction logic is designed
to give data- dependent execution time, the assumed
constant value is near the worst case, and the average
execution time will be considerably less.
No great sophistication of the compiler
either in the generation of efficient
optimization of register allocation or
reordering.
is assumed,
code or in
instruction
The scoreboard method for controlling the overlap of
instruction execution is assumed. As discussed in
Section 7.1, this is less effective than the queueing
method which would probably be used.
7-11
(e) The use of the new connection Network for access to
Extended memory will produce some delays caused by
contention within the CN or at EM. These delays are
difficult to estimate accurately, so a conservative
estimate was used. The actual delays in real program
runs will probably be considerably less than the
simulated value.
%
7.2.1 Selected Codes
The selected code segments were TURBDA, AMATRX, and a portion of
BTRI from the implicit code, 5X and a portion of CHARAC from the
Explicit code, and parts of AVRX, COMP2, and COMP3 from the GISS
weather code. The throughput indicated by simulation of these
code samples ranges from 70 MFLOPS for AVRX to 1330 MFLOPS for
AMATRX. The simulation results are summarized in Table 7.1, which
includes additional information on utilization of processor re-
sources and delays in execution as reported by the simulator. A
detailed discussion of this table, and the throughput--controlllng
features of the several codes follows. The FMP FORTRAN and assem-
bly-language versions of the codes as simulated are given in
Appendix G.
Some of the earlier simulations were performed with a model of the
Transposition Network, in which accesses to EM are synchronous
across the array, and controlled by the CU. Comparison of the CN
and TN performance indicate that the average contention delay
involved in using the CN is compensated by the fact that EM access
instructions through CN can be more completely overlapped because
of the buffering action of the associated CN Buffer unit in the
processor, and by the use of a much faster implementation of the
IMOD521 instruction in the later version of the model. The ear-
lier simulation results therefore remain essentially valid.
7.2.2 TURBDA
This code has four EM accesses (three LOADEM's and one STOREM) in
the inner loop, and 28 floating arithmetic operations, of which 18
are concentrated in an in-line Newton-Raphson square root. The
somewhat low throughput of 835 MFLOPS is accounted for by three
factors:
(a) The Average execution time of 8.1 clocks per floating
arithmetic operation is about I0 percent longer than
average because of a somewhat higher than average pro-
portion of divides and multiplies.
7-12
ti
i
i
!
i
7
H
<
0
(b) The integer operations form a higher than average pro-
portion, as shown by the IP use percentage and are not
overlapped as much as usual by floating operations.
(c) There are only seven floating operations per EM access.
7.2.3 AMATRX
This section of the implicit code is involved with generating the
local five by five matrices to be inverted by BTRI. Each
iteration of the inner loop performs 80 floating point operations
for five EM accesses (LOADEMs), and 53 local memory accesses.
This is a rather favorable case, as shown by the high (84 percent)
utilization of the FP unit in the processor. The fact that the EM
and local memory accesses are performed with no more than about 20
percent loss from the maximum theoretical throughput of 1680
MFLOPS for the per-FLOP time of 7.6 clocks indicates the
effectiveness of the instruction overlap.
7.2.4 BTRI
A representative portion of the BTRI subroutine was hand compiled
for this simulation. About 77 percent of the floating operations
are concentrated in an inner loop, which is a 5 by 5 nested DO
loop with only i0 floating operations and 7 indexed local memory
fetches in the inner loop. BTRI runs slower than might be expect-
ed for a subroutine with no EM accessing because:
(a) The indexing of the local arrays causes a large number
of integer operations which cannot be entirely
overlapped by the few FLOPs.
(b) Several of the integer operations are large (48-bit)
instructions, but with short execution time, so that
they use up code faster than it can be fetched. The
result is the indicated 14.2 percent of elapsed time
spent waiting for program fetch.
(c) The nested DO loop causes a large number of branches,
causing 3.3 percent of time to be spent waiting for
program fetch after branch, and further aggravating (b)
above because every branch causes any program look-ahead
which has been done to be wasted.
Note that some of these inefficiencies could be reduced by unwind-
ing the inner DO loop, which would involve repeating the same
brief code section five times. Some of the cost of indexing the
local arrays could be saved by reprogramming to store the N differ-
ent five by five matrices which are generated by AMATRX in a 25 by
N array instead of a five by five by N array. This, together with
the unwinding of the inner drop, would considerably increase the
throughput of BTRI. The loss of program look-ahead on branching
can be reduced by implementing fetch on no-branch instead of fetch
on branch.
7-13
%
°7-14
7.2.5 GISS Climate Code Samples
Analysis of the weather/climate code throughput is discussed in
Section 3.4.4. The samples selected for simulation were as
follows.
7.2.5.1 AVIL_
This routine smooths the data in the longitude direction for
latitudes near the poles in order to compensate for the too-close
spacing of these grid points. The number of iterations of the
smoothing algorithm therefore depends on latitude and must be
computed. The index computations are fairly complex, and the
smoothing algorithm itself is very simple, so there are less than
one floating arithmetic operation per EM access, and much integer
computation.
Furthermore, the organization into a DOALL in which 26 instances
are allocated to each processor in order to leave no processors
idle considerably increases the integer computation to be executed
in each instance, thus partially defeating the purpose.
The net _esult is a code sample which is in a sense a worst case
for the FMP with floating point throughput of only 70 MFLOPS.
7.2.5.2 COMP2
Portions of the CORIOLUS FORCE and VERTICAL ADVECTION code were
hand compiled and simulated. This code performs only about two
floating point operations per Extended Memory access, and the
addresses in two- or three- dimensional EM arrays are calculated
from the indices with no shortcuts, so that one or two double-
precision integer multiplies are required for each calculation.
The result is that integer arithmetic dominates the code, as shown
by the COMP2 entry in Table 7.1, where the integer processor is
busy 65 percent vs only 40 percent for the Floating Processor.
The floating arithmetic also is above average in clocks per
operation (10.3), and floating arithmetic is being executed only
30 percent of the elapsed time.
7.2.5.3 COMP3
The portion of COMP3 simulated was LINKHO having no EM accesses.
The throughput of 980 MFLOPS indicated by simulation is only 25 or
30 percent less than the practical maximum of about 1300 to 1400
MFLOPS attained when the processors are doing floating arithmetic
80 percent of the time. The COMP3 result is lower because of
three factors:
Ca) The average floating arithmetic operation takes 9.1
clocks, compared with the nominal average of 7.3,
because of a higher proportion of multiplies
divides.
and
CODE TURBDA AMATRX
MFLOPS 840 1330
CLOCKS/FLOP O 7.6
FLOPS/EM access 6.8 16.0
BTRI AVRX COMP2 COMP3
1200 70 380 980
Q 78 _ O
--- (note i) 0.86 ---
Percent use
FLArithFpiP _ 7_ 396458 _ _ 7_
DM 31 7 9
41 40 Q 40 32 40PM
Instr. Overlap 1.0 1.2 1.2 .96 1.2 i.I
Percent Delays
Branch
Prog. fetch
EM Access 16 11.9 --- 0.6
I
!
7-15
O Significant Factors
Table 7.1 FMP Simulation Results
Note i. See Appendix A for discussion of AVRX
7-16
(b) The FP arithmetic is being done only 70 percent of the
time; although the FP unit is busy 82 percent. This is
because of a number of non-arlthmetic floating register
operations such as change sign, move, and local memory
access.
(c) Program branching causes delays amounting to nearly four
percent of the time.
7.2.6 3-D Explicit Aero Flow Code
In the FMP model used for the simulations shown in Table 7.2 and
reported below, the local memory is homogeneous: both modules are
shared between data and program. One test run with the processor
model having separate program and data memories shows about 5
percent lower throughput because of waiting for program fetch.
Circles are used in Table 7.2 to call attention to those items
which limited throughput for each simulation.
7.2.6.1 CHARAC
This third level subroutine from the explicit code has no EM
accesses, but contains many data dependent branches: and DO loops
whose iteration count varies because of data dependent exits. The
CHARAC code would therefore be very difficult to vectorize, but
presents no difficulty to the parallel machine, although of course
the tests cost time.
The section of CHARAC code (shown in Appendix H) which was simula-
ted consists of a DO loop on JC and a portion of the code follow-
ing the JC loop. The JC loop is preceded by several local memory
accesses to save local registers, and within the loop are 24 float-
ing point arithmetic operations. It is exited by the AND of two
comparisons of floating variables. The JC loop contains an inner
DO loop on JJ, which performs only integer operations, and is
exited by the AND of two comparisons of floating point variables.
Three simulations were performed, varying the JC and JJ counts, as
shown in Table 7.2:
Ca) JC ioop performed eight times, with JJ performed six
times in each. This gives the low throughput of 900
MFLOPS.
(b) JC loop performed 15 times, with JJ performed once in
each, giving throughput of 1180 MFLOPS.
(c) Same as (a), but with JJ loop reprogrammed in FORTRAN to
use only one rather than two comparisons of floating
variables to decide the exit from the loop, and with the
new JJ loop coded for maximum efficiency by hand, using
tricks a compiler might be smart enough to use. The
throughput of this version is 990 MFLOPS, or 10% more
than version (a), because of the 40% reduction in run-
ning time of the recoded JJ loop. If the JJ loop is
performed fewer times, the throughput will approach or
slightly exceed case (b).
Table 7.2
Summary of Simulations of EXPLICIT CODE
CODE CHARAC LX/FE LX FX SQRT
(a) (b) (c)
M_LO_S 900iI_0990 STO s30 sgo lSOO
CLOCKS/FLO_Q O Q O O O 7.O
FLOPS/EM Access .......... 3.6 2.8 4.1 ---
Percent use
__. _ _ _ ® _o
FP 61 (_ 67DM 27 26 19 20 18
56 65 48 53 47PM
CN _...... 12 12 12
Instr. Overlap 1.07 l.J5 1.09 1.13 1.19 i.i]
Percent Delays
Prog. Fetch . 4.5 1.2 1.0
EM Access ...... (_ 4.6
O Significant Factors Affecting Th[oughput
81
7-17
I7-18
Nearly half of the JJ loop time is accounted for by waiting for
program fetch, both after a branch, and when code is being exe-
cut,.d faster than it can be fetched. This is another case where
more efficient packing of code or faster access to program memory
would appreciably imil_rq_e ,_hrq_gh_u_
It is interesting to note that in some algorithms the programmer
can use arithmetic comparisons and conditional branching to save
some arithmetic. In such cases the throughput measure of programs
would be more consistent if floating point comparisons were consi-
dered to be arithmetic operations; otherwise a program with super-
ior performance might be measured as having lower thrnughput. If
this measure were applied to the three cases of CHARAC discussed
above, they would become nearly equal at about 1300 MFLOPS.
Examination of Table 7.2 shows the following important factors
affecting throughput of the CHARAC sample.
(a) The average execution time of a floating arithmetic
operation is 7 to 8 percent higher than the nominal 7.4
clocks. This is because of a higher than average propor-
tion of divides
(b) Waiting for program fetch, both after branching and in
other places accounts for 13 to 15 percent of the elap-
sed time in cases (a) and (c). The high utilization of
Program Memory is not responsible: most of both delays
occur in the JJ loop, which uses a good deal of program
space while requiring little execution
(c) The utilization of the Floating Processor is low in
cases (a) and (c), even allowing for program fetch
delays, indicating a good deal of non-overlapped integer
computation. Again, this is mostly in the JJ loop, as
indicated by case (b) where the JJ range of code is
executed only once, and the FP utilization is only about
12% below the values attained in AMATRX and COMP3.
7.2.6.2 LX/FX
The second level subroutine LX from the explicit aero-flow code
executes within a DOALL with J and K as domain variables and each
instance has inner DO loops on I, with IL or IL-2 iterations. The
third level subroutine FX is called in an inner loop that is per-
formed twice, so FX is called about twice IL times in each in-
stance of JK. The simulations were run with the IL and IL-2 loops
both performed i0 times, since the computer runs would have been
too long with values of i00 and 98. At ten iterations the code in
the loops dominates the running time, so there is little error in
this approximation. Separate simulations of LX, with FX calls
deleted, and of FX code (with no RETURN) were performed. For
interest sake, the SQRT code which is present in-line in FX was
also timed by using the trace in the simulation output.
=t
J
7_
I!
The results are shown in Table 7.2. The FX calls contribute about
70 percent of the FLOPS in LX/FX, and the tabulated figures for
LX/FX agree with the weighted average between LX and FX figures.
The LX/FX throughput of 570 MFLOPS is limitod by the factors
circled in the table: (i) a mix of arithmetic instructions that
gives an average execution time of 8.3 clocks or about 12% more
than average, (2) a high usage (48%) of the integer processor,
(3) appreciab!a delay (4.5%) for program fetch after branch and
(4) 9.2% of the running time is spent waiting for extended memory
access.
LX is structured in such a way that much of the EM data for the IL
loops is pre-fetched to local arrays in a DO loop of IL iterations
which per'_rms no floating point arithmetic, and similarly, at the
end the local array of results is written back to EM. These pre-
fetch and post-store portions of LX take 12% of its time (not
counting FX). Similarly FX was coded to precalculate and save
indices and local variables used repeatedly in the code, and this,
together with save and restore of registers used in FX takes 13%
of FX time. The rest of LX and FX appear to be normal code, with
no more than average amounts of pure integer operations and loop-
ing and branching, so that the results should be considered normal
for the flops-per-EM-fetch ratio of these codes.
It is clear from the LX/FX simulations that at their rate of EM
accesses about half the execution time is spent doing the EM
accesses and the integer computations of EM addresses. As an
experiment, a simulation was run with the average delay caused by
contention in CN and EM reduced to 1/2 clock, as would be expected
for the actual average loading of CN (12%). This reduced the
running time of FX by 4.0%. In a second experiment, the execution
of times of double precision <nteger arithmetic were reduced to
the values estimated for single precision in a 32 bit integer
arithmetic unit. This produced a further 6% reduction, for a
total of 10% reduction with both changes, or an FX throughput of
650 MFLOPS.
7.2.6.3 SQRT
A new square root macro was programmed, using recently added
integer/floating transformation inst:uctions (FIX, FLOAT, ADDEX,
MOVEX) to address a local memory table for a first approximation
good enough so that only three iterations are necessary. The
resulting SQRT has 14 flops, runs in 119 clocks, and has a through-
put of 1500 MFLOPS for the array. The entry in Table 7.2 is
incomplete because SQRT was not simulated by itself, the values
shown being extracted from the trace of the FX simulation. This
SQRT accounts for 10% of the flops of LX/FX. The SQRT found in
TURBDA was an earlier version.
7-19
:p
7.3 APPLICATION OF SIMULATOR RESULTS
The above simulation results from the basls for the application
analysis summarized in Chapter 3 and described in more detail in
Appendix A. The extension of the simulator measurements to those
code sequences that were not simulated, is also described in those
locations.
7-20
i
I
8.1 SCHEDULE
Chapter 8
SCHEDULE AND FACILITIES
Ix
8.1.1 Introduction
Realistic scheduling of a large program such as NASF requires the
systematic definition of the tasks to be performed to levels for
which reasonable estimates can be made. With each successive
level of detail the time estimates become more accurate. For the
purposes of this study only the first level has been estimated for
the total effort. It therefore must be considered tentative.
Second and third level schedules have been prepared in specific
tasks areas to demonstrate the refinements that ultimately must be
prepared for the total effort ahd to illustrate the management
tools that can be used to monitor, analyze, and control the
program schedule.
8.1.2 The Overall NASF Program Schedule
The NASF program schedule presented in Figure 8.1 is based on a
number of factors and assumptions. It is assumed that the initial
sixteen months is dedicated to the design and final specification
effort. After this initial effort final design leading to
procurement, tooling and manufacturing will begin. Most of these
implementation tasks are of the order of fifteen to twenty one
months. The final period of integrations, delivery, installation
and testing is estimated at eighteen months. This results in a
total program duration of 55 months. The estimates are based on
past experience and best judgement. They do not represent either
the best or worst case possibilities.
No attempt has been made to define a critical path for this
summary schedule. However, critical paths have been determined on
schedules of individual activities as will be demonstrated in'the
examples that follow. The final output of the overall program
schedule is shown as the "deliverables".
I
!
i
!
i
The NASF schedule has been divided into nine task areas.
i. Program Management
2. Systems Management, Integration and Test
3. Flow Model Processor
4. File Memory Subsystem
5. Support Processor Subsystem
6. System Software
7. User Support Subsystem
8. Facility Engineering
9. system Support
The above breakdown is based on grouping of tasks of similar
nature or relating to a major deliverable element. This same
breakdown could be used for cost estimating as well.
8-1
J
0q
For the most part, the scope of these areas are self evident.
Program Management includes the monitoring, review, reporting and
control of the overall program activities. In addition, this task
includes schedule, cost and configuration control, generation of
procurement and production releases, subcontract performance
monitoring and liaison with customer representatives. The last
area, System Support, covers the tasks relating to reliability,
maintainability, human factors, spares, documentation and manuals,
and training. Intermediate milestones shown in these two task
areas are not major events but represent the bounds of the time
periods for certain emphasis.
The schedules presented assumes that most system integration and
testing is done on the manufacturers premises. A trade off,
depending on the availability of the structure for housing NASF,
may show that final integration and testing may be more effective-
ly done at NASA Ames, possibly shortening the schedule.
8.1.3 Schedule Management
The schedules for final design, fabrication, integration and
installation of a large system such as the NASF Processing Sy_oem
should be developed on a multilevel basis. The first level should
delineate the overall program showing major milestones and "deliv-
erables".
Each activity for these tasks areas may be delineated in more
detail in a second level of scheduling. These include such activi-
ties as the Integration Plan, or the Fabrication and Integration
of the FMP. A third level of schedule detail further delineates
the major activities within each of these task areas, such as the
design, fabrication and testing of the FMP Processor.
For most planning, schedule control and resource management this
level of detail is sufficient. However, fourth and fifth levels
are usually desirable for specific hardware, software and documen-
tation items to be produced or for individual personnel or group
assignments.
The first three levels are best managed by PERT (Program Evalu-
ation Review Technique) type schedules. In these schedules single
events (start and/or completion dates) are depicted as nodes. The
activities or tasks to be accomplished are depicted as the inter-
...... _ _^- _ 11hoar f]ow represent sequential and appro-
ximate temporal relationships. Where the completion ot one task
is a prerequisite before the completion or beginning of another
task that is not a natural sequence, a "dummy" activity is shown.
#
The result of this graphic representation of a group of activ-
ities, is a network, showing the starting event and activities,
the major milestones (events) and activities required to accom-
plish a desired goal or goals which are in turn shown as the final
event(s). The PERT network should clearly depict the interrela-
tionships between various tasks.
Once the time elements are assigned tc the tasks of a network, the
critical path can be ascertained. The critical path represents
that sequence of activities required for completion of the end
objective that requires the longest period of time; that is to say
that a single day (or month) slip in any one of the activities in
the path, will result in a day (or month) slip in the overall
schedule.
One of the many advantages of PERT is that it lends itself to
management, maintenance and analysis by data proce_3sing techni-
ques. This is readily accomplished by the use of Burroughs PROMIS
(Project Oriented Management information System) which has many
u_ful management outputs. A_tivities In-the c_ticai path are
easily identified. The slack in noncritical activities is
reported. The range of acceptable start and finish dates is
provided. Holidays, overtime and shift work can be made part of
the schedule. Flags for sorting of activities by discipline,
organization, or other keys can be employed. A PROMIS data base
is easily updated permitting rapid assessment of the impact of
changes or other new inputs. The use of this tool in initial
planning is shown in the discussion of the schedules that follows.
8.1.4 NASF SCHEDULES
Figure 8.1 illustrates the major activities of the nine NASF task
areas leading to the achievement of the final program goals (also
shown as deliverables). The interrelationships between some of
the milestones are shown with arrows. For example, the completion
of the final design and specification of the various hardware and
software elements are all inputs to the final integration plans
and system analysis efforts; the design and final specifications
of the hardware items is needed as an input to the activity that
will issue the final facility requirements documentation. For
each activity an expected time for completion is indicated below
the llne representing the activity.
Each event is given an identifying number which is used in
creating the data base for analysis and reporting. Table 8.1
delineates this numbering system and shows it to the lower levels
for certain categories.
8-4
Event Numbers
000000 - 099999
100000 - 199999
and Test
200000 - 299999
300000 - 399999
400000 - 399999
500000 - 599999
600000 - 699999
700000 - 799999
800000 - 899999
210001 - 219999
220001 - 229999
230001 - 239999
240001 - 249999
250001 - 259999
260001 - 269999
270001 - 279999
280001 - 289999
TABLE 8.1
NASF Event Identification Numbers
Task Area
Program Management
Systems Management, Integration
Flow Model Processor
File M_mory Subsystem
Support Processor
System Software
User Support Subsystem
Facility Engineerin9
System Support
FMP Processor
FMP Extended Memory
FMP Connection Network
FMP Coordinator
FMP Cabinets and Cables
FMP Power Distributor
FMP Test System
FMP Data Base Memory
8-5
To demonstrate the application of managementools for schedule
monitoring, analysis and control, the next two levels of schedule
detail for specific aspects of the NASFhave been defined. The
schedule for the fabrication and integration of the FMPwhich is a
major hardware item of the NASFand the final design, fabrication
and testing of the processors, which represent a major portion of
the FMP hardware have been selected for further delineation. It
is quite possible that the critical path for the NASF could be
dependent on activities in these two areas.
Figure 8.2 takes the single activity "Fabricate and Integrate" of
the Flow Model Processor task area and breaks it down in to the
next level of detail. The first node of this schedule corresponds
to the second node of the FMP path on the program schedule in
Figure 8.1; the last node corresponds to the third node on the
program schedule. The first node on Figure 8.2 divides (with no
time allocation) into the eight major elements of the FMP.
For scheduling purposes a preferred sequence of integration is
assumed. The Eirst point of integration is that of the FMP power
distribution system with the FMP cabinets and cables. The sche-
dule then calls for the integration of the coordinator with
the use of the FMP Test System (which will include the FMP diagnos-
tic controller). Not all of the FMP cabinets, cables and power
distribtuion system are required for the installation, checkout
and debugging of the coordinator. Completion of some portions of
these can be deferred until required. This level of detail can be
included on the next lower level of scheduling.
The installation of the connection network is next followed by the
installation integration and checkout of the processors and the
extended memory modules. The end events of the processor and
extended memory activities are shown as only two tasks for each
element, "Install First Processor" and "Install Last Processor"
and "Install First Extended Memory" and Install Last Extended
Memory". These end events are used in lieu of having 585* indivi-
dual inputs representing each processor and extended memory
module. A rather large series of activities such as the schedules
for each of the processors is best handled by a straight forward
status list.
Figure 8.3 further delineates the detailed activities for the
processor final design, fabricate and test activity shown on
Figure 8.2. It will be noted that there are several different
paths leading to the availabilty of the 585 processors. The upper
most path shows the activities for the design and procurement of
the printed circuit board. A second and third path are the activi-
ties relating to the design and development of the processor
tester and test software. The lowest path which merges with the
tester path involves the design, fabrication and evaluation of a
prototype processor.
*The current estimates for the number of processors and extended
memory modules manufacturing starts is 585 which takes into
account shrinkage and spares.
8-6 _!
oo
R_PRq_UCIBIfITY OF THE
.... _,q POOR
"0
,m
O
CO
0
C
i
oo
OJ
C
0
4J
0
oa3
up
.Q
t_
_q3
o
b3
o
f.u
¢¢1
8.1.5 Critical Path
The critical paths for the schedules shown in Figures 8.2 and 8.3
were determined using Burroughs PROMI8. A data base was created
listing each actlvlty's starting and ending event, the mean time
to complete and the activity description. A hypothetical start
date was declared and a PROMIS output was generated providing the
earliest and latest start date, earliest and latest end date and
the amount of slack in each activity. A hypothetical start date
in calendar terms is required, since PROMIS uses a real calendar
for its time base. This is done to permit considerations of week-
ends, holidays and for convenience of reporting. For the purposes
of this demonstration a start date of i July 1981 is hypothosized
for the beginning of the final design of the FMP. Figures 8.4 and
8.5 show the PROMIS outputs for the schedules in Figures 8.2 and
8.3. Table 8.2 explains the abbreviations used on the PROMiS
reports.
The critical path is that sequence of activities which show zero
slack. In Figure 8.4, PROMIS output for the FMP schedule the
critical path is seen as being:
i
Preceding Succeeding
Event Event Mean
Number Number Activity Description Time
240001 252000 The coordinator design, fabri- 50 weeks
cation and test,
Installation and debugging of
the coordinator
Installation and debugging of
the connection network
Initial debugging of the FMP.
TOTAL
252000 253000 12 weeks
253000 254000 i0 weeks
254000 299000 20 weeks
92 weeks
There is a parallel branch in the critical path in the test system
design and fabrication. Examination indicates 42 weeks slack in
the design, fabrication and testing of the data b_se memory. This
shows that a starting date of any where between 1 July 1981 and 21
April 1982 would not impact the finish date of 5 April 1983 for
that schedule element based on the estimate of 50 weeks for its
completion. The ability to determine slack permits the manager to
effectivetly allocate resources among the various parallel
activities.
',,4 ",
_m
0
4J
O0
4.1
O_
I-.I
0
O0
8-10
_/i/
_ l _ + * l * , i i l i I e t l I + i * + * * o + *
.+ ........... + ,+., ;. ;7+._ +
() ,) .+ +_ ++ _ .+ _ )

,_ ,,z
,,c
g ....
•_ ,,. I,. _. -_ -_
r_
_= .,..(
.,-_
o
_r..)
o_
oE_
1.4
_ro
o
tJ
4J _
0
r_
8-12
TABLE 8.2
PROMIS Report Terms
HEADINGS
io
i?>
PRED NUMBER
SUCC NUMBER
DESCRIPTION
MEANTIME
EARLIEST START
LATEST START
EARLIEST FINISH
LATEST FINISH
TOTAL SLACK
Preceeding event number
Succeeding event number
A brief identification of the activity
An Estimate in weeks (unless otherise noted)
of the time expected for completion.
The earliest date that an activity can begin.
The lastest date that an activity can begin
without impacting the schedule.
The earliest date that an activity can be
finished.
The latest date that an activity can be
finished without impacting the schedule.
The amount of time in weeks unless otherwise
designated) in escess of the meantime during
which a task may be completed and not impact
the schedule.
ABBREVIATIONS
PR Processor
EM Extended Memory
CR Coordinator
CN Connection Network
DBM Data Base Memory
FMP Flow Model Processor
POW Power
DIST Distribution
SYS System
PCB Printed Circuit Board
HDWR Hardware
MATL Material
DES Design
FAB Fabricate
INST Install
C.O. Checkout
SPEC Specify
MFG Manufacturing
EVAL Evaluate
8-13
;4
Figure 8.5, PROMIS Report for Processor Design and Fabrication,
reveals the following critical path:
Preceding Succeeding
Event Event
Number Number
210001 210005
210005 210010
(210005 210035)
210010 210210
210210 210230
210230 210250
210250 218001
218001 218585
218585 219585
Activity Description
Detail Design
Final Design
Power (Supply) Design
Design Testor
Design Testor Software
Develop Testor Software
Debug Testor
Begin Processor Tests
Test Last Processor
Mean
Time
8 weeks
8 weeks
(8 weeks
parallel branch)
12 weeks
8 weeks
12 weeks
8 weeks
15 weeks
.5 weeks
71.5 wee ks
The 71.5 week period begins 1 July 1981 and ends 14 December 1982
with the completion of the off-line testing of the last (585th)
processor. It should be noted that since there are no apparent
constraints on the requirement date of the first processor there
appears to be 18 weeks of slack. This apparent slack disappears
as soon as the requirement for the availability of the first
processor for installation into the FMP (as shown in Figure 8-2)
is considered. Accordingly, in reality, there is a branch of the
critical path after the sixth activity, Debug Testor, which is
Test First Processor, 2 weeks. This results in a critical path of
58 weeks for the availability of the first processor. This same
58 weeks is shown as the time for the first activity of the upper
path of Figure 8.2.
8-14
8.2 FACILITIES
Refinements made during this study on the concept of the NASF as
presented in the initial study [I] have had no significant impact
on the facility requirement (also presented in the initial study.
Table 8.3 summarizes these facility requirements for power and
floor space and places a maximum limit on each. Appendix J,
General Design Guidelines delineates environmental factors for the
design of the NASF Processing System hardware. These same limits
should be consistent with the environmental capabilities of the
physical building.
TABLE 8.3
Summary of NASF Power and Floor Space Requirements
POWER FLOOR SPACE
ESTIMATE 555 KVA 40,000 square feet
MAXIMUM 750 KVA 50,000 square feet
8-15
REFERENCES*
le Final Report, Numerical Aerodynamic Simulation Facility,
Preliminary Study, October 1977, Contract NAS2-9456, Burroughs
Corporation, Paoli, PA. NASA CR 152060, 152061, 152062.
2e Final Report, Numerical Aerodynamic Simulation Facility,
Preliminary Study Extension, February 1978, Contract NAS2-
9456, Burroughs Corporation, Paoli, PA. NASA CR 152106,
152107.
3. NASF DESIGN GUID_qCE STUDY, Preliminary Draft M. V. Markoff,
June 1978, Informatics, TN-78-2000-630-I.
4. NASF UTILIZATION, October 1978 (NASA Ames), Draft as amended
through subsequent communications.
5. A Working Paper on Fault Tolerance with respect to Numerical
Aerodynamic Simulation Facility, June i, 1977, Burroughs
Corporation, Paoli, PA.
6. Lawrie, Duncan H., "Access and Alignment of Data in an Array
Processor", IEEE Trans., Comp., C24(1975) pp. 1145-1155.
i
i
7. Berlekamp, E. R., Algebr@ic Coding Theory, McGraw-Hill, N.Y.
1968.
So Koppel, Robert, "RAM Reliability in Large Memory Systems -
Significance of Predicting MTBT" Computer Design, February
1979.
9. MIL-HDBK-217B, Reliability Prediction of Electronic Equipment,
September 1974.
10. ANSI X3.9-1978, American National Standard Programming Lan 9-
uage FORTRAN (Fortran 77), 1978.
ii. B7700 System Miscellanea, D.c. No. 5001886, Burroughs Corp.
12. B7/6000 Work Flow Language Reference Manual, D.c. No. 5001555,
Burroughs Corporation.
13. Chen, Shy-ching, Speedup of Iterative Programs In Multiprocess-
ing Systems, University of Illinois, Dept. of Computer
Science, Report No. UIUCDCS-R-75-694, January 1975.
14. B7/6000 Network Definition Language Reference Manual, D.c. No.
5001522, Burroughs Corporation.
15. Thornton, J. E., "Overview of HYPERchannel TM, ** COMPCON SPRING
79 Digest of Papers, IEEE Computer Society, pp. 262-265.
* Appendix references are included at the end of each appendix.
** NOTE: HYPERchannel is a Trademark of Network Systems Corp.
R-I
!4
APPENDIX A
PERFORMANCE PROJECTION BASED ON BENCHMARK PROGRAMS
A.I INTRODUCTION
The four prograns used as benchmarks in evaluating the design
were_
(i) NASA 3D implicit aerodynamic flow (aero flow) code
supplied by Ames
(2) NASA 3D explicit aerodynamic flow (ae_o flow) code
supplied by Ames
(3) GISS weather code, in se%eral different versions
(4) Spectral weather code from MIT
Evaluations of the first three were comprehensive, r,_sulting in
projections of 1.01 Gflops/sec for the implicit, 0.89 Gflops/sec
for the explicit, both at one million grid points, and 0.53
Gflops/sec for the GISS weather code.
A range of throughput values from zero to 1.50 Gflops/second for
individual code sections was derived from the simulation efforts.
These variations are primarily caused by the relationships of
individual subprograms to the data in local processor memory and
extended memory, the choice of mesh size and the choice of the
metric for performance measurement. An example of zero throughput
is provided by the subroutines BCY, BCZ and OUTER in the 3D
explicit aerodynamic flow code. These routines shuffle data in
data arrays in the EM. As no floating point operations are
required for this function a zero throughput value resL11ts. Data
sorting algorithms would be similar examples.
A throughput value of 1.45 Gflops/second is illustrated by the
intrinsic square root function. Square root operates entirely
within the processor, mostly in high speed local registers. Sub-
stantial portions of the simulated codes run from [.i to 1.3
Gflops/second rates. Examples of this are
• subroutine BTRI in the implicit code
. subroutine CHARAC in the explicit code
. subroutine LINKHO in the GISS code
Examples with lower throughput values typically occurred in rou-
tines where a high frequency of access to the three dimensional
global arrays was required. The ability to overlap array index
calculations with floating point operations is limited under these
conditions.
A-I
Performance is generally increased when the grid size is incre-
ased. The 3D explicit aerodynamicflow code showed0.79 Gflops
for 30,000grid points and 0.89 Gflops for 1,000,000 grid points.
The frequency of execution of individual code segments must be
known for the performance evaluations. Assumptionswere made in
those cases where data dependent loop counts and branches occur.
Throughout the prograus a meanvalue rul_ was generally employed
with an occasional reduction to somemoreconservative value where
appropriate. In one case, CHARACfrom the 3Dexplicit, simulation
was run at several different assumptions to test the sensitivity
of throughput to the data dependencyassumptions. In the case of
CHARAC,throughput varied no more than 15%.
The implicit code achieves the 1.0 Gflops/sec throughput r_te
being used as a guide. The explicit code appears to be about I0_
slower than the implicit code.
On GISS weather, the non-vectorizable portions of the code
exceededone Gflops/sec (COMP3),while the vectorizable portions(COMPIand COMP2)were slowed downby EMaccessing and memory-to-
memorymovesthat produced no floating point operations.
The following sections discuss the methods used for projecting
performance. Also to be reviewed are each program, and someother
applications, namelysorting and fast Fourier transforms.
A.2 METHOD
The method used for performance evaluation was generally the same
for all of the first three benchmark programs. Because of time
and budget limitations, only a cursory look was taken at the
Spectral weather code.
First, throughput was analyzed on the basis of FMP computations
only. I/O operations were ignored. Transfers between DBM and
file system are independent of, and go on in parallel with, the
FMP computation. It is assumed that the file manager stages the
next job, and unloads the last job, in times which are completely
overlapped with current computation. DBM-EM transfers are also
ignored, since they go on concurrently with current processing (as
long as EM space is available). At a transfer rate of 40 Mw/s,
the 15 million words of a restart point of a typical aero flow
code are loaded in 0.375 seconds, which can be compared with the
600 seconds duration of a typical run. Hence, even when not
overlapable because of EM allocation conflicts, they should have
little effect on aero flow computations. Therefore, both system
I/O and user I/O were ignored.
A-2
,$
,?
Each program was analyzed to find the calling tree of its sub-
routines. Major program parameters such as grid size, total
number of time steps, etc. were then established. This data
allowed the determination of the total number of executions of
each subroutine.
An analysis of all data declarations was then performed to
establish the GLOBAL or LOCAL memory palcement of all major vari-
ables. This analysis also determined those variables that were
potential type INALL variables. The programs were then scanned to
establish the placement of the DOALL statement construct through-
out the program structure. This information determined the number
of parallel machine cycles for each DOALL and the processor utiliz-
ation level number. A handcount was then performed on all rou-
tines to determine the total number of all floating point oper-
ations (f), the number of floating point divide operations and the
number of Extended Memory accesses (mi). Processor utilization
was also noted for each code sequence. Next, high usage sections
of typical code were selected for hand compiling into FMP machine
language. Results from detailed simulations of these code sec-
tions were then used to develop an empirical formula used to inter-
polate the performance of code sections not simulated. This
formula is a linear function of the number of floating point
operations, the number of floating point divide operations and the
number of extended memory accesses. These three factors are suffi-
cient to fit the simulation results, after constants are adjusted
to provide agreement with detailed simulation results.
The following symbol definitions pertain to the equations below:
T s = Total system throughput rate - Gflops/second
Tp = Single processor throughput rate - Gflops/second
Ef = Total floating point operations - Flops
E d = Total floating point divide operations over 2% of Ef
E m = Total Extended Memory access operations
E t = Total program elapsed time (I processor)
R i = Ratio of active to total processors
System throughput is then defined as:
= Total Flops = _fi
Ts Total Time [---{i
(A.I)
A-3
The linear approximation to this function was then determined as:
Ts = _i = 1.74
_-ti. K0+5.0 * Zm
Xf
as Tp = T s
512
(A.2)
(A. 3)
E t (Elapsed time) was then solved for as
E t = _f* (K0+5" __mm) , 512
_f
1.74
(A.4)
or E t (A. 5)
E t = K0"295" _f+1471" _m
The value of K 0 was then estimated as 1.0 or 1.2 based on indivi-
dual estimates of the quantity of nonfloating point commands in a
given code section. Basic system throughput could then be calcu-
lated knowing the individual counts of floating point operations
and Extended Memory access via
T s = _fi (A. 6)
* R.
_t i 1
-=0_?
A-4
where R i (ratio of active to total processors) was determined from
the analysis of parallel DOALL statements. Where the formula gave
results in excess of 1.33 Gflops/sec, for a particular code
sequence, the value 1.33 Gflops/see was adopted instead.
The above formula for calculating individual code segment times
assumed that two percent of the floating point operations were
divide operations. The divide instruction consumes 1460 nano-
seconds which is nearly six times longer than the estimated
nominal floating point instruction time. A special count of
divide instructions was therefore included in the analysis. When
this count exceeded the two percent rule a correction factor of
1460" excess count was added into the above time calculation
formula.
Examples of exceptions are TRIB and EIGEN in the implicit (too
many divisions), AVRX in the GISS weather (too much integer arith-
metic and data-dependent processor utilization).
UFigure A.I plots the formula used against the results of simula-
tions both for the implicit code, the explicit code, and the GISS
weather code. It is seen that the formula is validated over a
large assortment of "typical" codes. It is also obvious that the
formula must be taken with a grain of salt, and that each and
every section of code should be scrutinized to see if it repre-
sents some exception for which the formula will not work.
15
THROUGHPUT PROJECTION FORMULA:
SORT x
1.74
TS"- _..__p.--
K+5.0"A'_ AMATRX/ K |
_= /-/rx_//_uRsoA"_'"_ C.ARAC|°_}o
_._ NOTE'OMA POINTS MARKEO x HAVE NO E_ ACCESSES
.5, ' _"/</" ,x; _x
°°,./
e AVRX
5 10 15 20 oO
Xf
_-_ (FLOPS I ACCESS I
Figure A.I Throughput Projection Formula vs. Simulation Results
A-5
A.3 THROUGHPUTOFIMPLICITAEROFLOWCODE
A.3.1 Summar[
The throughput of the implicit code is 1.01 Gflops/sec for the
grid size of 100 x 50 x 200. This is the estimate resulting at
the end of the analysis. During the course of the analysis, as
various assumptions and corrections were being applied, the esti-
mate varied from 0.973 Gflops/sec to 1.043 Gflops/sec.
A.3.2 Assumptions
The following were the assumptions and program modifications used
to produce this result. Examples of the resulting code are
included in sections which follow. In addition, Appendix G has a
side-by-side comparison of some of the original codes and the FMP
codes.
All variables indexed on the three grid variables J, K, L were
assumed to be STRUCTURE arrays resident in Extended Memory. In
one case, the accessing pattern was such that the variables could
be ass_,ed to be resident in Processor _mory. In this case, an
instance being executed on a processor was able to access the
STRUCTURE variables without having the time penalty of EM acces-
sing. Not much improvement is expected when more of the STRUCTURE
arrays are processor resident.
The grid size is IMAX, JMAX, KMAX = 100, 50, 200
The compiler is able to use a MAD or FADEXL instruction when one
is appropriate, and to reorder arithmetic expressions. For
example, the expression (A + B'C/2) would be implemented with a
FADEXL and a FMAD.
I/O operations are ignored.
NMAX = i00, arbitrarily.
All arrays declared as A(720,6,30) where the 720 dimension is
indexed on KL = (L-I)*ND+K, and the 30 dimension on J are assumed
to be changed to A(I00,50,200,6) where the subscripts used will be
J, K, L, and whatever, respectively.
A total of 94 separate sequences of code were identified.
The computation of RESID at the beginning of STEP is assumed to be
a S_4ALL over the domain J=I,100; K=I,50; L=I,200. With 1920
cycles in this DOALL, the 9 extra steps at the end for the SU_LL
are insignificant.
A-6
i <%1 _
All calls on subroutines XXM, YYM, and ZZM were brought up into
line. Further, the resulting code was put down into the DO loop
that normally follows such calls so that XX, YY, and ZZ are recom-
puted each time. The result is that the four result values
produced by each single iteration, within the former XXM, YYM, or
ZZM, are used immediately, and can be LOCAL variables. If the
program were left in its present structure, where all elements of
the arrays XX, YY, and ZZ are computed at one time, the arrays
would have to be either INALL, with 100-fold waste of memory
space, or GLOBAL, with 100-way access conflicts in memory when
these one-dimensional arrays are used in two-dimensional DOALLs.
By computing these elements one at a time at the _oint where they
are used, the memory to store them is saved. The amount of compu-
tation does not change but several copies of in-line code are
needed to replace each such subroutine.
In VISRHS, and in BTRI, essentially identical code is seen repli-
cated. One copy is executed at one end point (say I=l), the other
copy is executed at the other end point (say I=IMAX), and the
third copy is inside a DO I=2, IMAX-I loop. In VISRHS these three
cases were subsumed into a single DO I=I.IMAX loop. In BTRI, an
observation on the incoming data shows that the first iteration is
degenerate (a diagonal matrix is being decomposed , which is
nearly a no-op), so the first copy is rewritten, and the latter
two combined into a DO I=2,IMAX loop.
SMOOTH was rewritten into a single three-dimensional DOALL.
Only those named common areas that are actually used in a program
unit are declared. This improves FMP operation, speeds up sub-
routine entry and sometimes releases memory space.
Where feasible, divisions were replaced by multiplication by the
reciprocal, including every division by a literal.
In doubly nested DO loops with simple subscripting (DO N=I,5 and
DO M--I,5), the code is assumed restructured either by the program-
met or by a later optimizing version of the compiler such that
there is no more than one integer multiply per set of subscripts.
For example, one can increment auxiliary index variables per
iteration. Two such loops contain 26 percent of all the floating
point operations in the program.
A.3.3 Analysis of Implicit Aero Flow Code
Equation A.6 (Section A.2) is an extrapolation of the simulation
results to the port ions of the code that were not simulated.
About 60 percent of the running time of the implicit code is
represented in the two simulations that were done, namely sub-
routine BTRI and the portion of subroutine RHS that used to be sub-
routine AMATRX in a previous ve,'sion of the program. This grati-
fyingly high percentage of execution actually simulated arises
becauseBTR! itself represents over 55 percent of the computation
of the implicit code. One statement in BTRI which is executed
25,000,000 times during the course of the program, represents 21
percent of all the floating point operations in the entire
program, and is found in the test case.
Exceptions to Equation A.6 are code sequences in TRIB, EIGEN, and
INITIA with an atypically high proportion of divides. These are
executed so infrequently as to disappear from the total throughput
figure. At the beginning of BC there is a section that could have
been implemented as a series of SUMALL's. In this analysis, the
summations were done serially %ith 38 percent processor uhiliz-
ation instead. On the other hand, a SUMALL was used at the begin-
ning of STEP to compute the variable RESID. This runs with "typi-
cal" speed because of the size of the DOALL, which is across all
three dimensions, or _,000,000 instances, so that the final 512-
way summation takes negligible time compared to the 1920 cycles in
the DOALL. The processor utilization for this case is 99.97
percent.
"AMATRX" was simulated. It is the part of the subroutine STEP so
identified in a line of colmnent. The test case consisted of 3750
floating point operations per processor, achieved by iterating
several times around the code. Hence, the frequency of execution
of loop control was somewhat higher than in the actual case in
STEP, where additional operations are in the same loop. The
observed time is counted in clocks per processor. At 40 ns per
clock, and 512 processors, this computes to 1.330 Gflops/sec for
the entire FMP. Overlap between the several execution units
within the processor was such that on the average there were 1.20
instructions in the course of execution at any one time.
"BTRI" was also simulated. The test case was constuucted by
taking the doubly-nested DO loop identified by the comment
"COMPUTE B PRIME", and following it with one pass through "INSERT
LUDEC AGAIN", and wrapping an outer loop around both. There were
a total of 650 floating point operations executed in simulation.
For present purposes, it is instructive to separate the 500 oper-
ations executed during the doubly nested loop, and the 150 exe-
cuted in LUDEC. _le LUDEC protion of the simulation executed at
1.30 Gflcps/sec, while the doubly nested loop executed at 1.170
Gflops/sec, at 512 processors busy.
Hence, the assumption of 1.33 Gflops/sec for "ordinary" code execu-
tion speed where all variables are local to the processor is justi-
fied. However, when single statements are found inside doubly
nested DO loops with triple subscripting on most of the arithmetic
primaries, performance is derated to 1.17 Gflops/sec. The two
swatches of code deserving this derating are the loops in BTRI,
and similar loop in VISMAT. The simulation printout associated
with this loop in test case "BTRI" shows that 14.6 percent of the
time the processor was waiting for instruction fetch. These were
primarily integer instructions associated with subscript computa-
tions. A sequence of integer IADDs, for example, can be executed
faster than the instructions can be fetched.
A.3o4 FMPFORTRANVersion
A.3.4.1 One-to-oneMappingfrom Serial FORTRAN
There is a simple one-for-one translation from FORTRAN furnished
by NASA into FMP FORTRAN as follows. All arrays subscripted with
the grid variables are made STRUCTURE. DO loops (single or
nested) on the grid variables are automatically turned into DOALLs
as long as the data dependence allows it. Temporary variables are
allowed to be LOCAL by default. The implicit code, as supplied by
NASA, is of such regularity that practically all of it can be
transformed into FMP FOTRAN using such simple rules. Because oi!
this, and in order to savQ time, most of the FMP FORTRAN version
of the implicit aero code was not even written down, since it was
obvious from the NASA-furnished version b f inspection.
SMOOTH and BTRI were rewritten to better match the structure of
the FMP. Discussion follows.
A. 3.4.2 SMOOTH
i
A revised FORTRAN version of subroutine SMOOTH is exhibited in
Figure A.2. All computation is put into a three-dimensional
DOALL. Note that the arrays Q and S (which have total dimension-
ality Q(I00,50,200,6) and S(I00,50,200,5)) are defined as
STRUCTURE variables since they are included in both an INALL state-
ment and in a USING clause over the domain. These arrays would
exist in Extended Memory. The other variables defined over the
structure (SS, CT, and the temporaries TI, T2, T3, and T4) are
allocated space within each processor. Note that only SS and CT
must be unique to an instance over the sections of the DOALL. The
temporaries could share storage with other instances. Computa-
tions on SS and CT, having 106 elements uniformly distributed over
the processors, will take up 1862 (cycles) * 6 = 11178 words of
processor memory during the execution of subroutine SMOOTH.
The other large user of processor memory space is BTRID, a LOCAL
C_4MON area which must be declared inside the DOALLs of STEP so it
can be common to the calls on BTRI. Here is an example of the use
of dynamic memory allocation. Upon leaving the last DOALL in
STEP, this LOCAL COMMON is deallocated, leaving space for SS and
CT to be allocated during SMOOTH. See the following section on
the rewritten BTRI for further discussion.
Temporary varibles TEMP, TI, T2, T3, and T4 were used to hold
copies o_ STRUCTURE array elements so that they could be used
through several operations with only one fetch from EM.
The statement NEXTDO, used in this code but not explained in
Chapter 4, is a convenience. The NEXTDO statement is shorthand
for an ENDDO statement followed immediately by another DOALL on
the same domain.
A-9
3OO
400
4_0
4_0
4_0
7vO c
_.000
_500 £
1700
1800
'900
._00(i
g_O 0
_300
_4(I0
E500
_600
_700
_.800
3000
_I00
)_00
3q.O0
._500
3600
_700
3800
3_00
4000 4
_UBRDUTIN£ _HODTH
CDttHDN/BR_E/NHAX_JHR×_KHRX_LHRXsDT_GRHHR_GRHZ_FSHRCH_
i DXZ_DY_DZZ_FU_5)_FD(5)_HD_RLP_GD_DNEGR_HDX_HDY_HDZ_RH_
Z CNBR_PI_ITR_I4P,INT_INT_INT3
DDIIRIN /HDDEL/; J_¢eZ00; K=k@50; L=Zt_00
RESIGN /THREED(<J=_QHRX-¢)_K=_KHRX-¢)_L=_LHRX-Z))/
INRLL /HUDEL/ _)_(5)_S#_CT_5)_T_Tg_T_Tq
4TH ORDER _M_TNZH_) _ DRDKR RT THE gOUHDRRIKS
DORLL /THREED(J)K_L)/ ; USING _ _ SHU
TEMP = 1./_J_KtL_6)
DO _ H_i_5
CONTZNUE
IF <J.Ee._ .OR. _.Ke. OHRX-¢) THEN
DO _ N=_5
C_IITIHUE
ELSE
DO 3 N:1_5
1 _,X(e(J_I_K_L_N)_T3 * G(J-i_K_L_N)xT4) - 6.xET(N))_T_HP
COtiTZNUE
ENDZF
NEXTDO
_F (_,Ee,_ ,DR, K.Ee,_HHX-_) THEN
TI=e<J_K_I@L_6_
DD 4 N=¢_5
SS = _ + U.SxSHUX_e(J_K+I_L_N)XT_ + e(U_K-I_L_NJ_T_ -
i Z.xCT(N>)_TENP
EQNTINUE
Figure A.2 FMP FORTRAN Version of SMOOTH
A-10
i
I
I
4100
4_00
4300
4LIO0
4500
4600
4700
4800
4900
5000
5100
5_00
5300
5400
5500
5600
5700
5800
5900
6000
6100
6_0
6300
6q00
6500
6600
6700
6800
6900
?000
?200
7300
÷
EL#E
Tl=a<J, K,_,L,O> R]_RODUCIBILI'_ OF THE
T_=a<J,K-_,L,O> ORIG]lqAL PAGE IS POOR
T_=_(J_K-£gLtb)
DO 5 N=Z_5
CDNT_NUE
ENDZF
N£XTD_
zr (L.Ze._ .DR. L,Ke. LHRX-£) THEN
TI=S(J_K_L+I_6)
DD 6 N=_5
e(J_K_L-£_N)_T_ - _,xCT(N))XTEHP
C_NTZNU£
ELSE
DD ? N=lS5
g - 6.XCT(N))_TEHP
CDNTZNU£
ENDZF
ENDDD /THREED/ ; GZUZNG $
RETURN
END
Figure A. 2 FMP FORTRAN Version of SMOOTH (Cont'd)
%
A-II
The resulting rewrite of SMOOTH reduces the number of flops from
225 x 108 to 201 x 108 , and the number of EM accesses from 195 x
108 to 72 x 108 , as compared to a mechanical translation of DO
loops to DOALLs. Thus, the time is improved more than the
throughput.
A. 3.4.3 BTRI
The subroutine BTRI was also modified. Observe that when BTRI is
entered, array B is a diagonal matrix with zeros off the diagonal.
The first piece of code. which is a copy of LUDEC, therefore
executes with most of its input variables equal to zero. LUDEC is
a modified Cholesky decomposition. When faced with a diagonal
matrix, it produces a copy of that matrix for the lower triangular
matrix, and produces the identity matrix for the upper triangular
matrix. In BTRI the variables LII, L22, L33, L44, and L55 are the
reciprocal of the diagonal terms, in order to save repeated unnec-
essary divisons later on. The diagonal terms of the upper trian-
gular matrix are unconditionally equal to 1.0 and hence are not
computed.
The first copy of the former LUDEC, as shown in NASA's BTRI, can
be simplified to the version shown in the attached listing, Figure
A.3. The last iteration of the former LUDEC, at index equal to
IUA, differs from the central iterations only by tile omission of
the computation of C PRIME. To simplify the source code, this
copy was pulled into the main iteration in BTRI.
Common area BTRID would be declared in STEP:
LOCAL COMMON/BTRID/ A(LMAX, 5,5), B(LMAX, 5,5), (CLMAX,5,5),
D(LMAC,5,5), F(LMAX, 5) in that call on BTRI in which the limiting
index is LMAX using the one-for-one translation of the original.
With LMAX=200, this means that common BTRID is 21,000 words long.
When the extent is JMAZ, BTRID will take 10,500 words and when the
extent is KMAX, BTRID will be allocated 5,250 words. Note that in
STEP, where this COMMON is initially specified, it is not declared
i_ a USING or GIVING statement. For this reason, it is a LOCAL
area allocated within each processor. The copy of the subroutine
BTRI resident in each processor accesses the common area in that
processor. By the time that BTRI is executing, the current
instance of STEP would have initialized the appropriate part of
that common block.
Within the separately compiled subroutine BTRI, the declaration of
BTRI takes the form:
COMMON /BTRID/ A(IUA, 5,5), B(IUA, 5,5), C(IUA, 5,5),
1 D(IUA,5,5), F(IUA,5)
A-12
i°
iO0
i'_0 c
iZO ¢
:40
"5O
16o
170
_._0 c
190 c
."00 c
_£0
_50
_.60 C
_80 c
_90 c
310
33O
3L_0
350 c
360 C
_70 C
380
390 C
_fo0 c
u,10 c
4_0 ¢
430
440
450
460
470
4_0 £,_
SUBRQUTINE BTRI(ZUR)
RSgUH£ STRRTING INDEX = 1
CDItHDN /BTRID/ fl(IURfS_5)_ B(IUR_5_5)_ C(ZUA_St5)_
I D(ZUR_5_5)_ F_UA_5)
DIHENSZ=N H(5_5)
XNPLICIT RERL(L_
INSERT LUDEC _Zl4PLIFIED FDR DIR_DNRL INPUT RRRRY B) FDR i=I
LiE = i. IBCi_i_i)
L33 = i,,'B<1_3_3)
L44 = _,/B(I_4_4)
L55 = £./B(I_5_5)
CDHPUTE LITTLE R_ DHITTED_ THESE TEHPDRRRIES NDT NEEDED
THIS PRS$_ CDHPUTE BI_ RI$
r(i_5) = L55
F(1_4> = L44
F(I_I) = L££
CDHPUTE C PRIHE FDR FIRST RDH
_._.. t .,l:.ITk" OF THE
oRIGINAL PAGE IS POOR
DD 1_ H = 1_5
C HR_ BEEN ELIHINhTED RS R SIHPLE
RESUBSCRZPTIN_ OF THE D RRRRY
S(i_5,N) = L55 X C(Z,5,H)
B(I_3_H) = L_3 X C(I_3_H}
CONTINUE
Figure A. 3 FMP FORTRAN Version of BTRI
A-13
-" "_*,_: %:! k: ,, : ....... ¢,..- , " !', _ue , 't_: ._ ._ '.,< ,_': " _ '::_ _:'_ ?_' ,_'::_ ::' : "" " _, ' , "_", " o . x _,_ ' ," .' ""_' : ,:'',.:'_" _ : - - ;.'_-
q90
5OO
510
550
56O
57O
58O
59O
GO0
610
64O
65O
670
660
690
7OO
7:O
7_0
7uO
75O
76O
77O
700
790
GO0
810
8_0
830
8qO
85O
C
c
¢
.C
c
c
J.q
c
c
c
;.I
c
c
c
c
c
C
HERE NDH _TRRTS THE HRIH LDDP DF BTRZ
DO 13 I = _gIUR
CDHPUTE B PRIME _ BIGR
DO £q N=Z_5
F_Z_N) = F(I_I4) -
- M(I_N,5; x F<_-I,5)
CDHPUT£ B PRIHE
DO 11 H = J._5
DD 11 H = ",_5
H,:N_H) = B(I_NglI; - R{I,Ng._.) K B(I-1,1911} -
B(I-ig_t4) - R{I_N_W_ x E{I-i94_H} -
R{I_N95) _ B._I-Z_5_IIJ
INKERT LUDKC RGRXN
HERE _HRLL BE INSERTED R COPY OF THE FDRHER LUDEE 9
EXRCTLY RS SHDHN ZN THE IHPLICIT CODE CDHPILRTIDN BY SCHREFFER
CDI1PUTE LITTLE R_S
Di = L_I A F(Igl)
of = LZZ * (F(I,_) - L_i × Ol)
D3 = L_ X (F<Ig_) - L31 x Ol - L3g x D_)
D4 = Lqq _ (F(I94) - Lq_ x Di - LH_ A D_ - LH3 x D3)
05 = L55_(F(19D) - LS&_DZ - LSgXD_ - L53xD3 - LSqxDq._
Figure A.3 FMP FORTRAN Version of BTRI (Cont'd)
A-14
\,
i860 C
870 C
880 C
_90
900
g£o
9_0
930
9u,.o
950
960
970
980
990
£000
10£0
£OEO
£0"30
£040
£050
_.060
1070 £5
£060
1090 i3
i£00 C
I£10 C
11Z0 C
1130 C
1£q0 C
£150
££60
£170 £9
11_0 Z0
1£90
1._00
CnHPUTE BI6 RI$
F(ZtS) = D5
F(Z_) = D_ -U45xD5
F(_3) = D3 - U34xF(%_) - U35xD5
V(Z_) = D_ - U_3XF(I_3) - U_xF(I_4) - U_5XD5
_F (! ,LT. ZUR_ THEN
DQ 15 N = _5
D£ = LZ£_C(I_L_H)
D5 = L55X(C_I_5_H} - LS£XD£ - LS_XD_ - ;.53xD3 - LSqxD_)
B(I_H) = D_ - U_SXD5
B(I_3_H) = O_ - U3_xC(I_H) - U35XD5
B(X_g_H) = D_ - U_3xB(I_3_H) - UZ_XB(It_H) - U_SXD5
, -, pAG_ _ pOOR
C_NTINUE
THIS Z$ THE END DF THE HRZN Z LDDP_ ZNCLUDIN_ Z=ZUR
H_TE THE NE6RTIUE C_DE _NCREHENT$ IN THE NEXT SECTION
D_ _0 Z = IUR-3_ I_ -£
DO 19 N=_5
CQt_TINUE
RETURN
END
Figure A.3 FMP FORTRAN Version of BTRI (Cont'd)
A-15
(Note: If the programmer is comfortable only with literal extents
on arrys, all these declarations could be replaced by COMMON/
BTRID/ A(200,5,5), B(200,5,5), C(200,5,5), D(200,5,5), F(200,5)
which, in the present instance, merely allocates some memory that
was going to remain unused in any event.)
For handling a larger mesh, note that only the diagonal elements
of A and C serve any real purpose. All off-diagonal element are
simply copies of elements of array D with offset subscripts.
Thus, with substantial complication, due to testing to see which
array should be fetched at any given time, BTRID could be shoe-
horned into 13,000 words for the 200-1ong dimension, or into
19,500 words for an LMAX of 300. The present analysis ignores
this possibility.
A. 3.5 Analysi s
Figure A.4 shows the sections of code into which the implicit
program was dissected for the sake of analysis. Subsequent to
this analysis, it was determined that all calls on subroutines
XXM, YYM, and ZZM should be brought up into line primarily to
avoid unnecessary saving of temporary variables, as described
above under "assumptions".
Table A.I shows some of the data abstracted from these sections.
In Table A.I the subroutines XXM, YYM, and ZZM have been combined
into their callers.
Table A.2 shows this data recombined into an estimate of overall
throughput. Rather than clutter this appendix with all inter-
mediate computations, Table A. 2 has the results accumulated by
"group", where a group is a group of swatches all with the same
multiplier, and the same, or approximately the same, processor
utilization percentage. Three subtotals are exhibited. The first
subtotal includes all the easy parts of the code, iterations or
instances which are at least triply nested on the four main
indices, J, K, L, and N the time step. The second subtotal
includes all those swathces of code that are essentially negli-
gible. Most of the time here represents serial computation where
all the processors are computing CONTROL variables in parallel but
only getting credit for the one operation that it would take in a
serial machine to compute this value. The third subtotal gathers
together some operations where one-dimensional DOALLs, with fewer
instances than there are processors, result in low processor
utilization. Even so, operations with low processor utilization
are essentially negligible at the problem size considered here.
%
A-16
,T-
_J
AIR3D---.EIGEN---(LKalI)---I--(XXM)----------XXMIoop.
L-.EIGENIoop
--INITIAT-JKLalI.
_GR_D_ , _JKLalI.
| =- GRIDIO.
_METOUT.
_JACOB-------F--(JKalI)_JKIoop.
_-(JLall) JLloop.
t--(KLall)-------KLloop.
--(Nloop).-SPIN.
-STEP
-OUTPUTIO.
FKLalI.
--AIR3DIO.
--(PLANE)-KLalI--
JKLalI.
BC (Lall)--Lloop.
JKall, JK3L.Kall TRIB_TRIBIoop.
_-Jall TRIB--TRIBIoop.
_KLalI.
U--JLall.
--(RHS)------F--(JKalI)----_J_P. ( ) --zzmoop.
_(JLall)--_JLloop.
(YYM)--YYMIoop.
_--(KLalI)---T--KLIoo p.
L-(XXM)--XXMIoop.
L-VISRHS----t--(JKalI)--T--JKIoop.
I L-(ZZM)--ZZMIoop.
h--(MUTUR)--_--JKalI--MUTURIoop.
L-(ZZM)--ZZMloop.
--(SMOOTH)--JKLalI.
--(KLalI)---T--(XXM )
BTRI'-
KLloop.
--(JLall)---T--(YYM)J
L--JLloop.
XXMloop.
, LUDECIoop--BTRIIoop.
- YYMIoop.
LUDECIoop-- BTRIIoop.
--(JKalI)---T---(ZZM ) ZZMloop.
_-VISMAT_[ZZM)_ZZMIoop.
| L--Vloop_V251oop.
JKIoop.
L-STEPSUM BTRI .... LUDECIoop--BTRIIoop.
--STEPSUMalI.
PLANEIO.
Key: CAPS = Program units.
-all = DOALL over indicated variables.
- loop = DO loop.
( ) = null node except for entry and return.
Figure A.4 Breakdown of Implicit Code into Segments of Code
and Nodes for Analysis
A-17
A.3.5.1 Description of Table A.I
In the "multiplier" column, N, J, K and L are abbreviations for
NMAX, JMAX, KMAX and LMAX respectively. Below that, is the multi-
plier ("E" stands for "times i0 to the") which results when these
extents are replaced by i00, i00, 50, and 200 respectively.
"Ident" is the identifier from Figure A. 4. Flops and EM accesses
are the result of a hand count of operations.
"Special Case" is the column for notes. The only special cases
noted are excess divisions, the occurrence of the SUMALL global
function, and
Note I: Many of the variables accessed here involve triply or
quadruply subscripted array elements. The progression of sub-
scripts is extremely regular, say indexed on loop variables
and by literals. It is assumed that the compiler or the
programmer has reduced these subscript computations to not
more than one integer multiply per accessed element. There
are several ways to accomplish this simplification.
Note 2: In reevaluating BTRI for this analysis, a substan-
tially higher portion of the floating point operations were
identified as FMAD than in the hand compiling that led to the
simulator input. A small adjustment was made on account of
this observation.
The notation "(x3)" or "(x2)" is used to signify that there are
three or two nodes or sections in the branching tree (Figure A.4)
with identical instruction counts, and identical number Df times
of execution. There seemed no need to repeat identical entries in
the table.
:Z
l _v
i ' (<;
)i:'.
• }>
,_.. o
" ',S
_=- i, '
A-18
TABLE A.I
Characterization of Implicit Code Sections
Multiplier
175NJKL
75E8
!25NJKL
25E8
NJKL
IE8
J
Ident.
-
BTRIloop
!VISMATIoop
STEPJKIoop
ISTEPKLIoop
STEPJLIoop
SPINJKL
BCJKL
RHSJKLoop
RHSJLIoop
RHSKLIoop
VISRHSJKIoop
MUTURIoop
LUDEC
VISMATIoop
Flops/
Section
i0
131
117
117
6
i0
64
64
64
210
533
376(x3)
224
EM access/
Section
29
25
25
4
6
17
22
22
19
99
0(x3)
18
Special
Case
Note 1
Note 1
Note 2
T
1.17
1,17
0.81
0.81
0.81
0.39
0.43
0,74
0.61
0.61
1.17
0.89
1.35
1.23
Proc.
Util.
97.4%
97.4%
98.4%
96.0%
96.0%
97.4%
98.4%
96.0%
96.0%
98.4%
98.4%
97.4%
98.4%
A-19
TABLE A.I continued
Characterization of Implicit Code Sections
J
?;
_-2o ?-
• ,,<
: _:2i
Multiplier
5NJKL
5_.8
NJKL
IE8
NJK
5E5
NJL
2E6
NKL
IE6
JKL
IE6
JK
Ident.
SMOOTHJKL
STEPSUM
BCJK
MUTURJK
VISMAT
BTRI
BCJL
BTRI
BCKL
BTRI
Flops/
Section
190
I0
667
170
5
10
12
i0
33
i0
EIGENIoop 228
INITIAJKL 6
GRIDJKL 3
JACOBIoop(s) 38
MAINIoopJKL ii
INITIAJK
EM access_
....Secti°n I
72
8O
2
0
0
24
0
16
0
56
6
6
24
5
Special
Case
SUMALL on
10 6 inst's
2 DIV
2 DIV
2 DIV
IDIV
5E3
JL
2E4
KL
IE4
JACOBJK
JACOBJL
JACOBKL
PLANEKL
2
121
3
i0
No flops
No flops
A-20
T
0.60
0.50
[.04
1.28
1.28
1.05
0.15
1.05
0.49
1.05
1.06
0.29
0.154
0.48
0.52
O.O88
0.000
0.000
0.197
1.18
Proc.
Util.
99.97%
99.97%
96.0%
98.4%
96.0%
97.4%
96.0%
98.4%
96.0%
ii
!
I
Multiplier
NJ
IE4
NK
5E3
N
IE2
1
IEO
NJK
5E5
NKL
IE6
TABLE A.I continued
Characterization o£ Implicit Code Sections
Ident.
BCJ
TRIB
BCK
TRIB
SPIN
STEP
BC
VISRHS
STEPSUMser
AIR3D
EIGEN
INITIA
TRIBloop
BCLloop
Flops/
Section
141
3
81
3
99
5
25
1
9
139
lO(x2)
43
EM access/
Section
37
SpecialCase
2 DIV
2 DIV
5 DIV
ll DIV
ID DOALLs
ID DOALL
over L
T
0.26
0.105
0.133
0.053
0.0026
0.0026
0.0026
0.0026
0.0026
0.0003
0.0012
0.0021
0.21
0.126
Proc,
Util.
19.2%
9.6%
0.19%
0.19%
16.0%
38.4%
i
A-21
TABLE A. 2
Throughput Computations for Implicit Code
L
Group
NJKL
NJKL*
NJK
NJL
NKL
JKL
Subtotal
JK
JL
KL
NJ
NK
N
I
Subtotal
NJK*
NKL*
Subtotal
TOTAL
Proc,
Util,
97,4%
99,9%
96,0%
98,4%
96,0%
97.4%
Flops Multi- Total
,per
3593
plier
IE8 3583E8
Time Throughput
96.0%
98,4%
96.0%
19, 2%
9.6%
0,19%
0.19%
16,0%
38,4%
190
852
22
43
314
1
0
123
7
3
210
149
IE8
5E5
2E6
IE6
IE6
5E3
2E4
IE4
IE4
5E3
IE2
1
190E8
4,3E8
,44E8
,43E8
3,14E8
3792E8
.5E4
0E4
123E4
7E4
1.5E4
2.1E4
149E0
134E4
(SCC.)
• ',,"'
343,2
31,6
,398
.083
,048
,50
375.8
,000078
,000117
,000078
,000269
.000024
,00794
,000057
,00856
10E6
43E6
50E6
,046
,341
,387
1,010
0.157
0.129
1,009
20
43
, ... .
5E5
IE6
3792E8 376,2
A-22
A.4 THROUGHPUT OF EXPLICIT AERO FLOW CODE
A.4.1 Summary
A.4.1.1 Results
A throughput rate of 0.89 gigaflops/second at an average system
processor utilization of 97.7 percent is estimated for the Hung/
MacCormack explicit aero flow code. This estimate is based on an
assumed grid size of i00 x i00 x I00 elements and I00 time steps.
A total of 4.73 x l0 ll floating point arithmetic operations are
executed in i00 time steps, in 532 seconds. An extended memory
data base o£ approximately nine million words is also required.
A. 4.1.2 Observations
The following general observations were made. Some of these show
up as conclusions in Chapter 3.
%
o A direct conversion of this algorithm into extended FMP
FORTRAN was accomplished, with considerable ease. All first
and third level subroutines (19 of 30) require basically no
change.
o The ease and efficiency of translation to FMP machine code
was also excellent. A major compiler requirement is
minimization of address indexing operations through
recognition of common subexpressions which are abundant.
o The correct algorithm includes a considerable amount of
simple moves from one Extended Memory address to another.
This is visible in routines BCY, PRSETY, BCZ, PRSETZ and
OUTER.
A.4.2 Assumptions
The basic formula used for calculating the total time per module
was transformed to_
Time = Kl*#Flops + K2*#EM + K3*#Divs
K1 = 295 nano seconds per floating point operation (flop)
K2 = 1500 nano seconds per EM access (#EM). This value
includes time for address calculation.
K3 = 1460 per divide operation (DIV) in excess of 2 percent
of the total £1op count.
This approach is verified for the explicit code through detailed
simulation of selected typical code segments. Subroutines LX and
FX were selected for this purpose. This data is included in
Figure A.I.
A-23
A-24
o The algorithm is a tree structured set of thiry subroutines
on three levels. Figure A.4 depicts this structure.
All level two routines are modified to employ the FMP DOALL
statement in place of the current dual nested DO
statements. The level one main program initializes GLOBAL
variables only. All level three routines are local sub-
routines (with copies resident in each processor) to be
executed in parallel as they stand.
o All GLOBAL values and simple constants are stored locally
in all processors.
o The grid size chosen for analysis was I00 x 100 x 100.
A. 4.3 Method of Ana!_/sis
The initial phase of investigation was a review of available
background material. The Navier-Stokes equations are the
essential mathematical model of the dynamics of a compressible-
fluid flow. Reference [A.I] provides the description of an
explicit discrete mathematical algorithm for solving these
equations. NASA supplied this methodology, and the FORTRAN
listing of the resulting program. Figure A. 5 shows this program's
structure. This information was then synthesized into Figure A.6,
a list of subroutine groups and the identification of the
program's major outer loop. Further detailed analysis of individ-
ual code segments determined the number of static calls on each
subroutine.
The next analysis was the identification of major data classes.
The following standard FMP classes were identified:
(a) Nine three dimensional shared arrays from the data
base. These arrays are in common and are STRUCTURE
variables in Extended Memory. This data is accessed via
three dimensional subscripts representing mesh points.
(b) System wide scalar variables (GLOBAL variables). These
variables are replicated in all processor local memories.
(c) Local common in processor memory. No communication of
this data between processors is required.
gI MAIN
Main loop starts
End of main loop
READIO
--MESH
--WALL
--PRTFLOW--WRITEIO
--BCY
-------TURBDA
TIMSTP
-- SBCINT
-- LYC
-- LYI
-- L¥
JCLMN
CHARAC
PRSETY
BCY
ADDG
OUTER
BCY
_PRSETY
_--TRIDIA
_--DIAGON
GI
_OUTER
BCY
PRSETY
OUTER
------SBCINT
--LZC _JCLMN
_--CHARAC
_---PRSETZ
_--Bcz
_---ADDG
L----OUTER
--LZ | BCZ
_---PRSETZ
_---OUTER
--LZI i BCZ
_--PRSETZ
_---TRIDIA
r----DIAGON
HI
_OUTER
_LX BCY
OUTER
"----.PRTFLOW--WRITEIO
_PRTFLOW_WRITEIO
Figure A.5 Calling Tree of Explicit Aero Flow Code and Segments
for Analysis
A-25
MAIN
Start of main loop
Endof main loop
Initialization routines (run once)
_LX
------LY
------LZ
------LYC
------LZC
------LYI
_LZI
--SBCINT (several calls)
------TIMSTP
------TURBDA
------PRTFLO
_Termination output (run once)
Figure A.6 Summaryof Calling Tree for Explicit Code
A-26
t-"
r_
(d) STRUCTURE data wherein an array of elements, one per
mesh point, may be kept in individual processors.
Subroutine SBCINT contains a three dimensional array
"SBC" of this type.
(e) Strictly local temporary data for a single subroutine or
DOALL block.
(f) Nameless temporary working store typically required in
expression evaluations. These are typically assigned to
processor registers by the compiler.
Except for type (d), examples are visible in the FMP FORTRAN
version of subroutines LX and FX (Figures A.7 and A. 8).
The next step in the analysis was a survey of all subroutines to
identify the DOALL statements. No such statements are required in
the main procedure or in any level three subroutine, which are all
local to the instances of the DOALL's. All level two subroutines
contain dual nested DO loops, which are directly converted to a
DOALL with I0,000 instances. The LX subroutine provides a typical
example. (See Figure A.7) Note in the listing of LX (in Figure
A.7), that the DOALL begins at line 102000 and ends at line
107600. Thus, almost all of LX consists of 10,000 instances of
this code (and the call to FX at line 104500 in each instance).
Thus, twenty cycles are required of each level two and three
subroutine to execute the I0,000 instances giving a processor
utilization of 97.7 percent.
The initialization routines MESH and WALL, being executed only
once, were ignored. The output routine PRTFLN was also ignored.
The next phase consisted of counting floating point arithmetic
operations, floating point divide operations and Extended Memory
accesses in the subroutines.
This count includes the effects of DO loops, fine or coarse grid
partial subscript range values and the program's branching
structure. This information is given in the various columns of
Tables A.3 and A.4. The product of these counts then produced a
total count of operations per subroutine. The application of the
formula: time = KI*#Flops+K2*#EM+K3*# Divs, then gave a total
execution time per program module. The total number of flops was
also given by the product of the number of flops per module times
the number of active modules. The system throughput rate Tp was
then compared to:
_-Flops
T = _Time
P
which is an average flops/second rate value.
A-27
i00000
_00100
I00800
£oogo0
_01000
_01100
I01_00
_01300
iOlqO0
£01500
I01600
i01700
iOl?lO
IOZO00
I0_£00
_OZ_O0
_OZ300
i0_00
I0_500
iOZ600
IOZ700
I0_900
I03000
_03100
_03_00
_0_300
i03500
I0_600
i03700
_03800
i0_8_0
£0_0
SU_RgUTINK LX
LX _PERRTOR
COIIHON/RI_/ RH_<I0,0si00_I00)_RH_U<IO0_£00_100)_RH_U(1G0_£00_I00)
C_HHON/R_/ PRDICT(10_5)_P(101)
COt_HON/R_/ Y(100_sOYCELL(100)sJSI_JEI_d_JE_JLFH_JL_VF_YH
1 _Z(100_DZCELL<I00_K_£_KKI_KS_KE_kLFH_KL_ZF_ZH
CQttH_H/R_/ I_HK_LE_E_ZL_Ki_k_K3_K_K5
C_HH_N/_5/ _RHMRfSRHMI_GRHHPR_CU_CUI_T_KE_UU_CU_P0_RH_0_RL,X0
C_HH_N/RT/ DX_DXi,DV_DYI_DZ_ D=I_EZWRLL_XRDBNL _DT_CFL_CONST
DO||RIN /EXPLCT/s_=I_i00_J=,ifi00;K=I,¢00
DTDX=DT_DXI
DO 3 I=I_L
PRDICT(I_=RHQ (l_J_}
PRDZCT(Z_Z}=RHOU(I_J_K)
PRDICT(I_3)=RH_U_I_J_K_
PROICT(I_)=RH_N(I_J_K}
PRDICT_I_5_=E _I_J_K_
C_NTINUE
DO _ N=i_
X=I
IRDD=N-i
14HI=N-I
B=I./N
_I=I÷I_DD
U_I=U_ZIfJ_K)
CRLL FX(UII_I_J_K_LZ)
DO 5 l=_lE
_3=Ki
KZ=K3
Figure A.7 FMP FORTRAN Version of LX
A-28
103900£04000XO_100£0_00
_0_300£0_;00£04500
zOO600
£04700
£0q800
£0q900
I05000
_05100
105_00
I05300
£05q00
£05500
£05500
£05?00
£05800
£05900
_06000
_06100
106_00
£06_00
106_00
106500
£06600
£06700
£06600
£06900
_07000
z07£00
£07200
107300
£07_00
£07600
£07700
£07800
£07900
5
c
6
C_t<t<
9
4
2Z=Z+IRDD
UZI=U(II_UsK)
UIZ=U(I+IsJsK)
UZ_=U(IsJsK)
ZF(UZZ°6ToUIZ,RHD,(3.XUI1-UZ_)X(_.XUZ_-UIZ),LT,O,) U_I=,SX(UII+UI_
x_
CRLL FX(UZZ_I_J_K_£Z_
PRDZCT(Z_)=(NHZ_PRDICT(I_)÷RHDU(I_J_KI-DTDXR(F(K_)-F(k£_)))_B
PRDICT(I_)=(NH£XPRDICT(I_)+RH_V(I_J_K_-DTDXR(F(_)-F(_I_)))AB
PRDICT(I_)=(NHZXPRDZCT(I_)+RH_N(Z_J_K)-DTDX_(F(N_q)-F(N£_)))_B
PRDZCT(I_5)=(NHI_PRDICT(I_5)+E (I_J_K_-DTDX_(F(K_5)-F(K£_5)))XB
CONTINUE
DEC_DE x
DO 6 I=_sZE
RHQI:_,/PRDICT(I_£)
U (Z_J_K)=PRDZCT(=_)XRHDZ
U _I_U_)=PRDZCT(I_)XRH_Z
U (Z_UtKJ=PRDICT(I_q)_RHDi
EZ(I_J_K;=PRDZCT(I_5)_ RHDI -.SX(U(IJ_K)_÷U(I_J_K)X_÷W(I
R(Z) =GRHHZ_PRDICT(I_Z)AEI(I_fN)
CONTINUE
XDOWN_TRERH B. C° RT I=IL
DD 9 K6=£_5
PRDICT(IL_KS)=PRDICT(IE_6)
CRLL BCY(K_Z_IE_J_J)
CONTINUE
RHUU(I,_,K)=PROZCT(I,_) ':':2_:_! )_GE Io pO0_
RHDV(I_U_K;=PRDICT(Z_)
RHON(I_J_K_:PRDZCT(I,_)
E (I_J_K)=PRDICT(I_)
CONTINUE
CRLL DUTER(U_IsJE_sKSI_E_)
RETURN
END
Figure A.7 FMP PORTRAN Version of LX (Cont'd)
A-29
%100000
_GO±O0
101_00
101300
101qo0
1O1500
101600
i01700
101800
101900
10_000
10_100
10_00
10_300
lO_qO0
i0_500
10Z600
i0_700
I0_800
I0_900
103000
103100
I03_00
Z03300
iO_qO0
SUBRDUTZNK FX_U_I_Z_J_K_Z=_
X TRRN_PflRT AND _TRE_S ZN X-DIRECTION
CD|IHDN/Rll/ RH_(£OO_£00_I00)_RHDU(100_I00_I00)_RH_U(ZU0_lO0_I00)
COIIHON/RI_/ RHDN(IOO_IOO_iOO)_E(IOO_IOO_IOO)_EI(ZO0_IO0_IO0_
C_tIHDN/RI3/ U(IuO_IOO_ZOO)_V_£OO_iOO_ZOO)_N(IO0_iO0_ZO0)
CDHHflN/RIq/ F(_5)
Cflt|HDN/R_/ PRDICT(101_5)_P(101)
COIIHgN/R_/ Y_IOO)_D';CELL(ZOO)_JSl_J31_J_JE_JLFH_JL_YF_YH
£ _Z(IOO)_DZCELL_ZOO)_KSi_KEI_S£_KE_KLFH_KL_ZF_ZH
COHHDN/R_/ ZSHK_ZLE_ZE_IL_KI_K_K3_K_,K5
CO|tHDN/RS/ _RHH_AHHI_RHHPR_CU_CUZ_STOKES_uO_cO_P_RHOU,_L_X_
CDtIHDN/R6/ RHUL(iO0_IO0_O0)
CDtIHON/R6£/ RHU_RK_RLHBDR
CDHHDN/RT/ DX_DXZ_DY_DYI_DZ_ DZZ_EZNRLL_ZRDBNL _DT_CFL_CDN_T
COHHON/RS/ I_HTHX_Z_HTHY_I_HTHZ_ LY_CNT_ LYCCNT_ LZCCNT_ LZICNT_
I NLYZ_NLZZ_BETR_ET_I_CRKNZ_
CDItHDN/RN6L/ TRNT(IO1)tC_T<IO1),TRNTH_TRNTHB_CD_TH_CDZTZe_ZECTH
CDHH_N/V_SCDU/_GX_IG'{_ZGZ_TRUX¥_TAU_Z_TRUYZ_D_X_DI_Y_DZ_Z_
x UYX_VYX_WYX
RHU=RHUL(IZ_J_K)
RK =GRHHPRRRHU
RLHBDR=STO_ES_RHU
DYI=I,/ (Y ( J_ )-'( _ J-i )
OZI=I./<Z<_+_)-Z<K-Z)>
DyX=,Sx(TRNT(I)+TRNT(I+I))_DY1
Figure A.8 FMP FORTRAN Version of FX
A-30
i_03500
£03600
£03700
£03800
£03900
£04000
_0_00
_oq_o0
£0q300
£0_00
104500
z04600
£0_700
109800
10_900
£05000
105100
105400
105500
105600
z05700
£05800
£05900
£06000
£06100
UYX=U(II_J_I_K)-U(II_J-&_K/
UYX=V(II_J_£_K)-V(iI_J-Z_K}
SIGX=P_II_ -(RLHBDR÷_,_RHU)_((U(Z÷I_J_K)-U_I_J_K))xDX£-U'(XxD'(X}
X -RLHBDRR(UYXxDYZ+(N(II_J_KT£)-N(II_J_K-£))_DZ£)
TRUXY=-RHUx(U'{XxDY&+(U(Z+Z_J_K}-U_IfJpK_DXZ-UYXXDYX)
TRUXZ:-RHUX((U(II_J_K_Z)-U(ZI_J_K-Z))_DZ£÷(N(I+ZgJ_K_-N_Z_J_K)_
X DX£-(H(ZZ_J¢£_K)-N_II_J-Z_K3)xDYX_
DI_X=SI_XXUII+TRUXYXV(II_J_K_+TRUXZRN(II_J_K)-RK_(EI(I+£fJ_K)-EI(
XI_J_Kp)XDXZ-_EI(IIfJ+Z_K)-EI(II_J-_K))_DYX_
F(K_):PRDICT(II_£)AUII
F(K_)=PRDICT(II_)_UZI÷SI6X
F(K_3)=PRDICT(ZI_3)XUII+TRUXY
F(K¢_)=PRDICT(II_)AUII+TRUXZ
F(K_5)=PRDICT(II_5)AUII+DISX
IF(ISHTHX.E_,0 .DR, Z,LE.£ .DR, I,GE,IE) RETURN
x SHBBTHING TERN_ x
CBEr¢C_NST_RBS(P(II+I)-_,_P(II)+P(II-i))/(RBS(P(II+i))+
X _,xRBS(P(II))÷RBS(P(II-£)))
CII=_eRT(6RHMRR6RMMZ_RB_(EI(II_J_K)))
COEF_C_EFR(RB_(U(II_})+C_I)
DD 9 K6=1_5
F(K_sK6)=F(K_K6)-CDEFR(PRDICT(I+I_K6)-PRDICT(Z_K6))
RETURN
END
Figure A.8 FMP FORTRAN Version of FX (Cont'd)
REPRODUCIBILITY OF TI-II_
()RIGINAL PAGE IS POOR
A-31
• _r
A-32
U
O
U
J
O
Total Flops
# Active CPU
Total Flops/CPU
Total Time
usec
Total Time/
Call usec
# EM's/Call
# Divs/Call
# Flops/Call
Total # Calls
# Time Ste_
Cycles
Calls/
Time Step
Routine
o ! I ! !
,-I ! ! ! 1
o I I 1
O t I
o I I !
 lo,,
o ! I
°°
O OO O _(D C_ OO OU%
_o_ _ o _oo
_ _ __ oooooooooo
o_ ooooooo
_o_o ooo
o o_S ooo o o
ooooo oo(D ooo_ ,_ oooooo
_..._ .... _ ...........
Rr
Total Flops
# Active CPU
Total Flops/CPU
Total Time
usec
Total Time/
Call usec
# EM's/Call
# Divs/Call
# Flops/Call
Total # Calls
# Time Steps
Cycles
Calls/
Time Step
Routine
_o _ oo _°_°a °
_o_oooo oo
_3 80 S8 .......
...._ooo _....._ oooo_ _ _ _
oooo ooo ooooo oooo
_ _
_oooo oooooo.... _ _ _oooo ooooooo oo_o oo°°
ooooRRRR oo ooo
_ _R_R RRRRRR? R RR RR
.............
"3
0
U
II II II
0
U_..O
_O_
_ e,, r,
4_ -p 14
O0,r:
%
i A-33
PARAMETERS
Throughput
Table A.4
0omputations for Explicit Code
i00 x I00 x i00 GRID SIZE
i00 TIME STEPS
%
ROUTINE
LX
FX
LY
LYC
LYI
LZ
LZC
LZI
SQRT
CHARAC
DIAGON
TRIDATA
PRSETY
PRSETZ
GI
HI
ADDG
JCLMN
BCY
BCZ
OUTER
SBCINT
TIMESTP
TURBDA
PRTFLW
TOTAL TIME - us
3.30E07
5.75E07
4.02E07
1.16E07
2.74E07
3.23E07
1.14E07
2.61E07
1.47E07
1.07E08
3.54E07
2.88E07
2.55E07
2.55E07
1.59E07
1.47E07
6.66E06
1.28E06
5.04E05
5.04E05
8.40E05
0
I.]5E07
3.75E06
5.32E08 us
TOTAL FLOPS
1,98EI0
3,60EI0
3.13EI0
9.83E09
2.46EI0
2.33E10
8.62E09
2,02EI0
2,08EI0
1.45EII
5.00El0
3,64EI0
5,86E09
5.86E09
1,08El0
6,60E09
9,40E09
1.00E06
0
7.39E09
1.60E09
4.73EII
THROUGHPUT
0,60
0,63
0,78
0.85
0.95
0.72
0,76
0.77
1.41
1,36
1.41
1.26
0.23
0.23
0.68
0,45
1.41
0
0
0,64
0.43
.89 Gflops/
sec
A-34
_%;,
r _./''
_ i?_i"
R
i
i
i
A.4.4 Simulation and Hand Compilin@
A validation of the above analysis was conducted by simulated
execution of a typical code section. The FMP simulator is
described in Chapter 7.
The main stream subroutines encompassing the bulk of execution
time were LX, LY, LZ, LYC, LZC, LAX, LYI, LZI and their associated
third level subroutines. The subroutine LX and its associated
level three subroutine FX were selected as representative of this
algorithm. LX and FX are shown in Figures A. 7 and A. 8 respec-
tively. Figures A.9 and A.10 show the FMP FORTRAN versions of
TURBDA and OUTER which were also simulated. No special handling
was needed on these subroutines. Each is a demonstration of a
simple conversion of nested DO loops to a DOALL construct.
The initial effort in preparing the simulator input was the
revision of the original FORTRAN code sections into the extended
FMP FORTRAN language. Modifications were primarily in the areas
of data declarations, domain declarations, and DOALLs. Assignment
to GLOBAL variables was assumed done in parallel across all
processors. In addition to these changes, the code was reviewed
for areas in which an optimizing compiler could be expected to
achieve time savings. These changes typically take the form of a
new local temporary variable holding the evaluation of a common
subexpression in order to improve performance.
In particular, common subscript expressions were detected and
evaluated separately during both the hand analysis and the simul-
ations. These expressions all involve the integer mode sum or
products used to compute an address from the subscript values.
Although a mature, optimizing compiler will find such common
subscript expressions and combine the results transparently to the
user, the hand analysis performed this level of optimization by
hand.
For example in the subroutine FX, 25 three dimensional subscript
expressions may be reduced to seven common expressions. Other
changes such as the use of an iterated DO Loop rather than
straight line code were made to reduce the size of the generated
machine code file.
A-35
_0
_qO
7OO
750
900
;.000
;._00
J.qo0
.1.500
1600
J.700
2800
:gOt)
_000
SUBRDUTIN£ TURBDR(CV)
CDHtlgH/RII/RHD(_OU,ZOO_IOO)fRHDU(_OO_ZO0,_O0)_
RHDV(¢O0,¢O0_O0;
COHHQN/R_/ PRDICT(Z0Z_5)_P_0¢)
CDIIHBN/R_/ Y(ZOO&_D'KCELL(_OO)_J_Z_JE¢|J_JE_JLFH_JL_%'F_YH
Z _Z(Z00)_DZCELLk_00)_K_KE_K_KE_KLPH_KL_ZF_ZH
COHIION/R_/ I_HK_ILK_XE_IL_Kz_K_E_Kq_K5
CO#|HON/RS/_RMHR_GR|4H¢_GRHHPR_CU_CVI_$T_KES_U0_C0_P0_RHD0_RL_0
CDMtIDN/R6/ RHUL_Z00_Z00_L00)
DBHRIN /E×PLCT/II=Z_0_J=Z_Z00;_=_00
INRLL/KXPLCT/ TEHP
CVZ : Z.0/CU
OORLL J=V_Z_JE_;K=Kgi_kg_| U_ING /RZZ,'_/R5/
DO Z I=Z_ZL
IF (K,Ee,Z) TEHP=U,SXRBS(EI(I_J_Z)_E_(I_J_))XCV_
ELSE XP(J,Ee,_TEI4P=_,5_RB$(EZ(_Z_K)÷EX_X_K_)_CUI
ELSE TEHP=RBS(EI(I_J_K))XCUI
END_F
RHUL_I_J_K) = _._70E-V_eRT(TKMPAX_)/TEMP+lgS,_)
C_HTIHU£
ENODO_ GIVING /R6/
RETURN
END
Figure A.9 FMP FORTRAN Version of TURBDA
A-36
f_
H
c
t
_00
_'_5
_0
_g c
14o
_50
:GO
:_0
"90
_00
_0
_40
?.5O
_.G5
_70 3
F.75 c
P.80
_.90
_00
_J.O
_0
340
350
360
'370
£UBRDUTXNE OUTER(J_,JE,K_,KE)
CDHI4DN/RL£/ RHO_IUO,_OO,ZuO))_HDU_UO,ZOO_zOO))RHDUkIO0,iOO,IO0)
CDIIHDN/RZ_/ RHDN_IOO_UO_ZOO)_(ZOO_IOO_ZOO)_EI_IO0_ZO0_ZO0_
CDI|HDN/RL#/ U_IUO,LOO,IOO),v_ZUO_LOO)£uO)_N_ZUO);O0)IO0)
CDIIHDI&/R)/ "/(LUO))D')CELL*,_UO)+J_I,JEI_J_)JE_,JLFI4)JL,'(F,'_H
fZ_ZOO)_DZCELL_LOO)mh_I_kEZ_I"S_KE_tKLFHtEL_ZF_2H
DDNN_TRERH RT %=IL
RHB(XL,J_) u _Hg(IE,J,K)
RHgUkZL_J_K) = RHOU(IE,J_K/
RHDU_XL_J_K/ = RHDU(XE_J_K)
RHDH_XL_J_K) = RHDN(XE_J_K)
ENDDD _ GIVIN_ /RZL/_RI_/
IF (JE,LT,JE_} GO TO
UPPER B, C, HT J=JL
RHD(IIJK_K/ = RHg(Z_JC_)
RHDU(_JK_K) = RHDU_ZtJE_,K/
RHDU(I_JK_K; = RHDU(I_ JE_)
RHDH(I_JK_K) = RHDH&I_JE_K)
E_I_JKsK} = E_I_JE_h}
ENDDD( GZUIN_ /RIZ/_/RZ_
_F _H,GE,KE_) THEN
EDGE B_C, RT H=KL
DDRLL J=J_@JE_I=_IE ; U_XN_
RHD(I_J_KL) = RHO_I_J_kEE)
RHDU(Z_JsKL} = RHDU(Z_J_KE_)
RHDU(X_JtKL/ = RHDULI_U_KE_)
_HQN_I_J_L_ = RHDN_Z_J_KE_
E(IsJ_KL; = E(I_J_KE_;
ENDDO_ GIUIN_ /RI1/_/RZ¢/
ENDIF
RETURN
END
R_PItODUCIBI_ICY OF 'rtt_
ORIGINAL pAGE ]_ I_)OR
Figure A.10 FMP FORTRAN Version of OUTER
A-37
A-38
The next phase was hand compiling. It was assumedthat the first
four registers in all groups were designated scratch registers.
They were also used for passing parameters and results to and from
subroutines. The remaining registers were employed for longer
lifetime storage requirements.
Experience during the hand compilation demonstrated that the need
for integer registers exceeded the supply. As a result, storing
and restoring of these registers had to be employed.
In subroutine SBCINT and JCLMN a non-standard approach was
assumed. Both routines have a very minor impact on total through-
put. The SBCINT routine performs a clearing operation on the
three dimensional array SBC and is called four times. SBC was
declared to be an INALL array. A single statement, SBC=0, there-
fore clears it. The routine JCLMN is called from LYC and LZC
subroutines outside of their DOALLs. Although the routine could
be programmed using recurrence, there seems to be no advantage.
This routine is, therefore, assumed to be executed serially.
A.5 GISS CLIMATE PERFORHANCE EVALUATION
A.5.1 Summary
The evaluation described below was done on an intermediate size
(2 ° latitude steps, 2.5 ° longitude increments along the equator)
weather program. The program consists of an easily vectorizable
fluid dynamics section (subroutines COMPI and COMP2 and the sub-
routines they in turn call), and a hard-to-vectorize physics and
chemistry section (COMP3 and its subroutines). The average
throughput for the entire program was determined to be 0.532
Gflops/sec. The time for a 14-day simulation with 20 minute time
steps was projected to be 4 minutes, 25 seconds.
A GISS weather demonstrated the advantages of the FMP architecture
over that of a vector machine. The vectorizable portions of the
program tended to run slow because of many EM accesses, but the
unvectorized portion of the program, namely CObIP3 and its
subroutines, ran at 1.2 Gflops/sec for the portion simulated.
A.5.2 Discussion of the Analysis
The following versions of the weather model codes were provided by
NASA as input for selecting an FMP benchmark test.
%i
i
i
i
GISS Models
A. 360/65 version
B. 360/195 version
C. STAR i00 version
D. ILLIAC IV version
The various versions are machine dependent versions of the Mintz-
Arakawa differencing scheme which numerically solve the differen-
tial equations representing the physical dynamics of weather
conditions. Reference [2] describes this methodology.
The basic database for the GISS model is a series of three dimen-
sional arrays. The data values in individual arrays represent
temperature, pressure, humidity, etc. at each point of the assumed
latitude, longitude and altitude grid. Arrays of one and two
dimensions are also utilized in various code sections in addition
to various simple scalar values.
Minor variations in GISS versions exist due to selecting different
granularities in grid size, time step. split grid, step size, I/O
management and the nature of the Host machine architecture
(Scalar/360, Vector/Star Array/ILLIAC) considerations. Grid
sizes vary from a coarse (25, 40, 2) to a superfine (180, 288, 9)
as indicated in Table 2.1 of reference [i]. The historical
increase in computing power has provided the facilities for includ-
ing the larger grid sizes and smaller time steps and thereby
improving the accuracy of results.
A medium sized grid of (89, 144, 9) was selected for FMP Benchmark
purposes. This size is a valid test of the system's dexterlty,
although a larger size would probably enable higher system effic-
iencies and simple program conversion to the 512 processor _Ip
system. The 360/195 non-split GISS version was used as the basic
FMP benchmark model.
Simulation of code running on the FMP system is necessarily limit-
ed by time and cost. These requirements necessitate the separa-
tions of the GISS model into low and high use frequency classes in
order to expedite the analysis. Routines of low frequency
(once/run) and therefore considered of null impact were:
INPUT
GMP
SDET
A-39
Routines of high frequency and therefore maximum impact were:
COMP1 - AVRX
COMP2 - AVRS
COMP3 - OZONE
- SOLAR
- LINKHO - SQRT
- EXP
AVRX is an extremely frequently used subroutine and presents an
interesting opportunity for optimizing FMP performance. The
function of AVRX is as follows. First, for every latitude J,
compute a number NJ(J) (also called DRAT and FNM). Then, for each
point J,I (I is the longitude index) perform a smoothing function
S.
New PU(J,I) = S(Old PU(I,J-I), Old PU(I,J), Old PU(I,J+I))
over all values of I. Then update Old PU = New PU at all values
of J, I. For any given latitude J, do the smoothing NM(J) times.
NM(J) is a non-decreasing function of distance away from the
equator, although this fact is not used in the original program.
Several methods of converting this subroutine into FMP FORTRAN are
discussed below.
i.
.
3.
A DOALL on J, with the programming over I and N serial inside.
89 out of 512 processors have instances, and the longest
instances occur at the poles where NM has its maximum value.
An outer loop on N, iterating the number of times given by the
maximum value of NM at the poles. Inside, a DOALL over both J
and I allows all processors to execute on the first iteration
of the N loop, but as the successive iterations of the N loop
occur, those instances which test and find that N.GT.NM(J)
exit without performing any work. At the end of the N loop,
only those instances which lie at the poles are doing any
work; the others are idle.
Like 2, except that when computing NM(J), the smallest J is
computed (nearest the equator) for which the given value of NM
occurred, giving JL(NM) as a GLOBAL array. The program
structure would look like:
DOALL J=l, JMAX
NM(J) = arithmetic expression
JLI(NM) = smallest J in northern hemisphere for which NM has
the value shown in the subscript
JL2(NM) = largest value of J in southern hemisphere for which
NM has the value shown in the subscript
_NDDO
%
A-40
DO1 N= i, NM(poles) %N loop
DOALLJ=l, JL2(N); I=I,IMAX %All points needing smoothing
%in the Southern hemisphere
PU= S(Old PUvalues)
ENDDO
1 CONTINUE
Method 3 avoids the creation of instances that do no work, and
hence enhances processor utilization. Even though the last few
iterations on N have only 144 instances, since JLI will equal JMAX
for large N, and JL2 will equal 1 for large N, the average number
of processor busy would be substantially better than that for
either method 1 or method 2. The cost is increased overhead at
the beginning of the DOALLs.
4o Method 2 can be modified as follows. First, the DOALL on I
and J can be replaced by DOALL J=l, 89; II=i,i09,36. Inside
the DOALL, a loop, DO M=I,36 is added and the subscript I is
set equal to II+M. The result is that 36 neighboring values
of I are computed within a single processor, and the same old
value of PU can be fetched once from EM for all three uses
within the smoothing function. The result is a decrease in
the number of required EM accesses by almost a factor of
three, while processor utilization is reasonably good (356
processors out of 512, for the particular example).
5. With even more complexity in the management of the mapping
between domain variables and I,J, one can have 36 values
of I per instance, and keep 494 processors busy.
Time precluded simulating any more than one of the above options.
Option 5 was selected for simulation, and produced the result
shown in the table. One of the reasons for selecting option 5 is
that the remapping of the values of I,J into particular processor
might be done, not explicitly by the programmer as shown in
example 4, but by having the compiler map particular instances
into particular processors. In the prototype compiler, it is
expected that the assignment of instance number to processor will
be fixed at processor number equal to instance number modulo 512.
Future compiler enhancements could include statements that allow
the programmer to specify how instance numbers map onto
processors. The simulation, to some extent, was an investigation
of the value of such mappings.
After AVRX, the body of COMPI and COMP2 are the next most
frequently used. With minor exceptions, they have common coding
characteristics. They were:
- Heavy use of Extended Memory
- Heavy use of three dimensional indexing
- Low number of floating operations/access
A-41
V
A-42
The initial section of COMP2 was judged to be typical and was
therefore simulated on the instruction timing simulator.
COMP3 is executed once for every NCOMP3 executions of COMPI and
COMP2. The radiation routines, LINKHO, etc., are called every
NHOGAN times that COMP3 is called once. Values of three, and five
respectively were used for NCOMP3 and NHOGAN. In COMP3 and its
subroutines, computations are carried on along the vertical direc-
tion, making each latitude-longitude point independent of any
other. Thus COMP3 partitions into a set of independent instances,
each having a specific location on the earth's surface. COMP3 and
its subroutines are characterized by:
- Minimum use of Extended Memory
- Simple parallel partitioning
- High number of floating point operations
- Low number of indexing operations
- Data Dependent branching
The two maximum frequency inner loops of the LINKHO subroutine
were judged typical of this code-section and simulated in detail.
The routines actually simulated during this analysis are summar-
ized in Table A. 5. They are:
* LINKHO (portions)
* COMP2 (portion)
* AVRX
A.5.3 FMP FORTRAN Version
Figures A. II, A.12 and A. 13 repsectively show the FMP FORTRAN
versions of AVRX and the portions of LINKHO and COMP2 simulated.
Note that AVRX and COMP2 make substantial use of DOALL constructs.
LINKHO does not demonstrate any DOALL constructs since it is
called within each instance of sections of COMP3. LINKHO is an
exceptionally good example of the data and instance-dependent
computation in COMP3 which would execute efficiently on the system
evaluated even through it would be difficult to vectorize. The
aerodynaaic flow codes analyzed did not exhibit the independence
between instances to this degree. Substantial use is made of
parts of the language that see little or no use in the two aero
flow codes, including:
o Domain definitions constructed using domain
expressions that
include previously defined domains. (See AVRX
for example)
o INALL declarations (See AVRX for example)
Figure A.14 shows the branching structure of the subroutines.
Note the presence of A**B, which is a form of call on the EXP
function.
Table A. 5
GISS WEATHER MODEL
BENCHMARK SIMULATION RESULTS
Measure AVRX
Total no. of CU simulated instructions 48
Total no. of EU simulated instructions 3800
Total no. of EU machine clocks consumed 25318
Total no. of floating point register
related instructions 338
Total no. of floating point arithmetic
operations 134
Total no. of machine clocks for F. P.
arithmetic operations 1039
Total no. of integer/logical instructions 2800
Total no. of control type commands 662
Average execution time for all
instructions (NS)
Average execution time for floating
point operations (NS)
Average total elapsed time for floating
point operat ions
266.5
310.2
7557.6
Rout ine
COMP2
32
3094
23417
900
688
7052
1449
745
302.7
410.0
1361.5
LINKHO
30
2529
16705
1058
1266
11624
425
1056
264.2
367.3
527.8
A-43
1000
£1o0
I_oo
_300
1500
16oo
1ooooo
1o0100
1oo_00
100300
i00_o0
100500
_00600
_00700
_00800
100900
I01000
101100
£0£200
£0£300
101400
_01500
101600
_0£700
£01800
£0£900
10£910
10_000
I0_I00
i02200
_0_300
_02500
i0_600
10Z700
10_800
NOTE THRT THE CODE DEVELOPED BKLDLI Z$ HRNURLLV ttRPPED
TO THE HRRDHRRE BY STRUCTURIN_ THE CODE
THE DDtIRIN DEFINITIONS RRE U_ED TO RLLDCRTE
HORK TO PRDCE_SOR_ RND TO CYCLE_ (INSTRNCE_)
HITHIN ERCH PROCESSOR,
SUBROUTINE RVRX
STRUCTURE CDHHDN ,,,,HERE RRE RLL RRRRYS IN BLRNM COHHDN°oo
6LQBRL CaMHDN o.,, HERE RRE SIHPLE URRIRBLE_ IN BLRNK CDHHDN.°
£ _RLPH_i6)_DRRT(16)
GLDBRL CDHHDN /NDTK/ PU(89_1_)
DDHRIH /PROC/| PND=O_511
DDHRIN /C7C/: _N_T=lq_6
DDHRIH/RVRXD/t /PRnC,'.X./CTC/
IHRLL/PRDC/ TPU(_8)_TTPU(_8)_II(_8)_Jd(_)_EIHI_EI_EIP1
STRUCTURE LDGICRL DDNE(JH_'/}=oFRLSEo
C CRLCULRTE DRRT(4)_RLPH(d}_ _hJNL_H_J)o ONE URLUE PER LRTITUDE
DDRLL J=_sJII_X-i
TDRRT = D¥P(_)/DXP(J)
RLPH(J; = V°¢ZSx(TDRRT-I)/FL_RT(FIX(TDRRT))
DRRT(J_ = TDRRT
t|LIH(J> = FIX_TDRRT_
EHDDD
C LORD TPU _TH PU
DORLL /PRDC(PND) /
OD £ t1=1_6
C t_DTE THE IN_TRNCE NUHBER HHICH Z_ CDI1PUTED HERE
ZHHO = 51Z_(H-I)+PH_
II(H) = INND,'dHRX+I
JJ_tl) = HDD(INNO_tI_×_ # ¢
IF <XI_H_ o_T, 1_ EXIT
TPU(H) = PU_JJ_H;_iI(H)>
1 CD#_TINUE
EHDDD/PRDC/
C _TRRT R DO HHXLE
£00 CONTINUE
Figure A.II FMP FORTRAN Version of AVRX
A-44
£O_gO0I03000
103100
_03_00
:03300
I03_I00
:03500
103600
103700
103800
103900
.I.Ou_00 o
104100
104;_00
_0_300
10=.1.400
£04500
I0 u,600
104700
1O48O0
104900
105000
1O51OO
I05_00
10530O
I05_00
105500
105600
:,.05700
I05800
±OSgO0
106000
106100
I06_00
..06300
i06_00
106500
:06600
",06700
I06800
i06650
.1.06900
OOALL /PROC(PNO)/
EINl = TPU<I>
EI = TPU(_)
DD /EYE(ZNST)/
Z = II(INST)
J = JJ_IN_T)
ZF (I °_T, ZHRX) EXIT
IF <(DRRT(J).LT.1) .OR. (N°_T.4LIH(J)) THEN
DONE(J) = .TRUE,
_D TO g
ENDZF
EIP1 = TPU(INST+_)
ZF<I.EO°0) EZHi=PU(J_IHRX}
IF (I,E_,ZHRX}EIPI=PU(J_I)
TTPU(INST) = EI + RLPH(I)_{EIHI+EIPI-_.0_EI)
STORE CRSES
IF ((INST.EO.¢) .OR. {IN_T.EO.Z6)) PU(JgZ) = TTPU(INST)
IF(I°EO.1) PUkJ_HRXP1)=TTPU(INST)
_F(I°EO. IHRX_ PU(J_0)=TTPU(INST)
EIH1 = EI
EI: EIP1
ENDDB /C'/C/
SYNCH POINT
NEXTDO
DO IN_T = _6
TPU(IHST) = TTPU_INST)
ENDDa
DO ZN_T=Z_8_7
TPU_INST) = PU_JJ%INST)qIZ_INST))
ENDDO
DD ZNST = 1_
_F((I,EO°0).OR._i. EG.ZHRXPI))TPU(INST)=PU(JJkINST).II(INST))
ENDDO
N=N_I
DD tt:l,Z
TPU(ES+M; = PU_II{HJ_JJ_II_)
ENDDO
ENDDa/PRBE/
DDRLL d=_dttHX
;F (RLL_DOHE_J))) RETURN
ENDDD
_0 TO 1oo
Figure A.II FMP FORTRAN Version of AVRX (Cont'd)
A-45
=_:.!
:_i__,
i r_. _
::_ A-46
! L.,,
100000
lOOZO0
i00_00
100300
I00_00
100500
1O0600
lO0?O0
100800
I00900
101000
101100
I01_00
_01_00
101400
101500
_01600
101700
101800
101900
10_000
i0_100
i0_00
£0_300
lottO0
10_500
10_600
10_700
_0_800
10_900
£03000
103100
_03_00
£03300
i03_O0
£03500
103600
103730
103600
103900
10_000
£0_I00
_Oq_O0
10430O
10q_00
I0_500
_UBRDUTINE LZNKHO
CDI1HDtl /RRDCOH/PL_9)_FLE(IO)_FLK(9)$T_TS,TL_9)_TSTR(_)
1 _3HL_9)_CLOUD_Z_RE(IO)_RE_TR(3)_FLXDN_RS(9_A_TR(3)_
_C_C_Z_RSURF,_COSZ_RRP_RRH
COIIHDN /CLDCOH/ _HALE_£6)_HIL(15)_AL_I6)_TRUL_Z6)_OZRLE(I_)_
1 TDPRBS
LBGICRL CLDFLG_RERPLGfLZ_L_
RERL TRUCIR_CTRU55_X_PZOfTN_RERl_RER_RERR_RERC_RERU_RERU_
1 EXI_EX_DENU_DNH0_DNHI_RERU_EXTRU_TRU_RONCN_EDNCN_TDFCN,
EUPCN_EDNCN
INTEGER NCLOUD_I_)_NRERO(I_)
RERL CZREXT_Z_),TRUN(I_3)_PICIR(£_)_PIZ(I_lg)_CB_Z_£_)_
£ BTDP(£q)_TDP(£E)_REP(£_)_EUP(I_)_EDH(£E)_TE3(301)_EUPC(I_),
RDDITIDNRL DECLRRRTIONS NOT USED IN THE _IHULRTED PORTION
ARE OHZTTED FOR BREVITY
STRTEHENTS REBUT PRRRLLELISH RRE OHITTED RLSO SINCE LINKHO
IS C_LLED R_ R SUBROUTINE NITHIN THE INSTRNCE_ OF THE
DORLL /LRYERS/ OF CDHP3+ iN THIS CR_E_ ERCH INSTRNCE
CRLLS LINKH_ INDEPENDENT FROH RLL OTHER IN_TRNCES RND
U_ES R LDCRL COPY OF C_DE NITHIN THE PROCESSOR IN HHICH
THE _NSTRNCE RESIDES, EE_UENC_N_ OF THE EXECUTION HITHIN
THIS _UBRDUTINE IS _OLELY DEPENDENT ON THE INSTRNCE RHD
LOCAL DRTR_ NOT ON ANY OTHER INSTRNCES.
DD ZOO LRH = i_i_
DD £00 K = I_3
OO I01 N = i_NLAYR_
HCC = NCLOUD(N/
,:RER = NRERO(N)
TAUCIR = CZREXT_LRH) X CTRU55 _ NCC
X = TRUN<N_K) + TAUCIR
TAUN(N_J = X
PZO =_TRUCIRAPXCXRO_LAH_ + PIZ(LAH_N_)/_+Z,E-_U)
ZP(N._E._) THEN
TN = TL_N-5)/_73.
ELSE
TN = T_TR(N_/_73.
ENDIF
iF (TN._Eo0,_5_ ,RND, NCC,GT,0JPIO=U,
IF_PIO,GT,Z,E-_) THEN
RER£ = £, - PZO
RER_ = 1° - _PIOACB_LRH_N_
RERR = _RT(AER1/RER_)
Figure A.12 FMP FORTRAN Version of LINKHO
Bi
J
i
J
10q600
10_700
lOqSO0
10_900
105000
105100
_05_00
£05300
105#00
105500
£05600 I
105700 ONtll =
IUSS00 EUP(N)
105900 1
106000 EDH_N)
£06100 i
106_00 REF<N)
106300 TDF(N)
£06305 ELSE IF
106310 TOr(H)
106315 REF(N)
1063_0 EUP(N)
i063_5 EDN(N)
105_00 ELSE ZF
106500 TDF(H_=I,U
106600 REF(N_ = 0.0
106700 _UP(N; = U,0
i06_00 EDNkN} = 0,0
107_00 ELSE
107#50 ;F _x
107500 E×TRU
107600 ZT'( =
_07700 TDF(N
i07800 ELSE
i07900 EXTRU = 0,U
£0_000 TDF(N_ = 0.0
£03050 ENDZF
i0_00 REF_N)
10_g00 $1 = l,
_08300 ::_ = ((
10_0P i 0
£0350b EDH(N;
108600 EUP(NJ
IUG?00 EHDZF
10_800
_0_900
RERU = (1. - RERR)/_,
RERU = (1, R RERR)/_,
RERC = SQRT(3,XRERIXRER_)
%1 : -(RERC_X)
EX£ = 0.U
;F (XI .GE. -£_0,_18) EXI = EXP(XI)
ZF _E_I,LT,I,uE-30) EXi=0.0
EX_ = EXI_EXI
DEN0 = I./((RERUxRERU_ - (RERU_RERU_EXE))
DNtl0 = ((BT_P(N) - BTDP(N+I)/_XXRERE))_
(_RERV - RERU_EX_) - (RERR_EXI))
RERU + RERUXEX_
= (BTOP<N)XDNH£ - DNH0 - BTOP(N+I)_EX1)X
DEN0_RERR
= _BTDP(N+i)_DNH1 + DNN0 - BTgP(N}_EXI)_
OENUXRERR
= HERU_RERUR(I.-EXZ>_DEN0
= (RERV-RERU_xDEN0_EX1
(NCC.GT_U) THEN
= 0.g
= 0.0
= BTaP(N)
= _TDP(N_I)
X,_T_I_E-_ THEN
.LE, _5,0) THEN
= EXP(-X_
P_PR0-")UCIBILITY OP TH_
'-_TGTNY; ,, PAGE IS POOR
) = TE_(ITY) • _TY-IT'$_I) A _TE3(ZTY_I)-TE3(ITY))
= U,U
U - TDF_N;
1,0 - EXTRU)/X-TDF(N_) ^ <(eTOP(N) - ETOP(N_lJ)R
,6666)
= BTOP(N+I)_xI+_
u BTQP(H)_XI-x_
DEN0 = I_0/_I.U - RDNCN_REF_N_)
EDt_CN = _EDtICt=-,EUP,_H_xRDHCN_ _ TDF(N_ :,_ OENO '1" EDIJ(N_
Figure A.12 FMP FORTRAN Version of LINKHO (Cont'd)
A-47
£09000
£09100
109200
z09300
109400
£09500
109600
109700
109800
_09900
£10000
110:.00
_10_00
110300
110q00
£10500
110600
£10700
110800
110900
£11000
£11100
£11200
111B00
£11350
111H00
z££500
_11600
111700
111_00
I11900
i£_000
_1_100
11Z_00
11_300
i£gqO0
£1a500
££a600
11a700
11_800
11Z900
113000
1di
11G
£00
_F (NCC.GT,0) CLDFL6 = ,TRUE,
IF(CLDFL6,RND,PID,6E.1,UE-_) RKRFLG=,TRUE,
IF {,HDT,(CLDFL6,0R,RERFL6)) THEN
TRU = TRU _ X
IF (TRU ,_T, 15) THEN
TDFCN = O.
ELSE IF ((_O,XTRU+i,).LT,_) THEN
;TY=£
TOFCN _ TE3(ITY_+(TY-ITY+I)_(TE_(ITY+I)-TE3(ITY))
ENDZF
ENDIF
IF _RERFLG) THEN
RDNEN = REF(N_ _ TDF(NAXTDF(NA_RDNCN_DENO
TDFCN = TDFCNATDF(N_DENO
ENDZF
IF(NCC.HE.0 .OR. FIO.LT.Z.0E-4) THEN
TDFCN = U,U
RDNCN = 0,0
TAU = 0,0
ENDZF
EUPC_N) = EUPEN
EDNE(N_ = EDNEN
TDFE(N) = TDFEN
RDNC(N_ = RDNCN
COIITINUE
DO Zl_ H = N6-£_£_-£
DEND = 1,O/_I,O-RUPCN_REF(H))
KUPCN = EUP<H) _ (<EDN_H)XRUPCN+EUPCN) X TDF(H)xDENU
_F (H,NE,£) THEN
RUPCN = REF(H* • _TDF(H)_TDF(H)ARUPCN_DENO)
L=H-¢
DENU = I,/_£o-RDNC(L/xRUPCN_
PEFUP = kKUPCN _ EDNE(L_RUPEN)xDENU
PEFDN = (EUPCN + EDNC(L)ARDNC(L))_OENO
ELSE
PEFUP = EUPCN
PKFON = 0,0
ENDZF
FE(H) = FE(H) + _PEFUP-PEFDN)_CLKRH
CONT_NUE
CONTINUE
COtlTZNUE
Figure A.12 FMP FORTRAN Version of LINKHO (Cont'd)
A-4g
%:,
Z.... i
!%
'eL,"
90000
90_00
gO_O0
_00000
_00_00
_OOZO0
_00200
_O05GO
_uO_O0
_00700
_00_00
_00900
_OXO00
_01400
_O&500
_o_oo
_u_30o
_o_qoo
_oz5oo
_o_6oo
_0_800
_03000
_03100
_0_00
_0_800
_03go0
_0_000
_04_00
_04300
_oq_oo
_OqSO0
C
C
C
C
C
C
C
C
C
C
:..0o
_oo
C
THIS ZS THE _'CTZDN OF G_$_ , COIIP :_ THRT HR$ _I#.IULRTED
CDI&TZNUE
ENDDO
DDRLL j_JI4_;=X_Zl4
;14_ = iH
_L_E
iH_ _ 2
_NDX_
DD ZOO L = _NL_'_
CDRXDL;_ FORCE
DORLL J=_JIII4Z_I=Z_Z|4
;F <;,E_,Z} THEN
;14_=ZH
ELSE
;HZ = i
END;F
DO _O0 L=_HL_(
HERE THE COtltIDN _U_CRZPT E_PRE_;ON_ ARE t_OT GXUEN
_U7 THE CBttPZLER ;_ R_UHEO TQ HRUE KXTRRCTED THEH RPPROPRT_TELY,
REPROII_U_ILITY _ THE
ORIGINAL PAO]_ IS_. R
RLPH = FXCLI x _.P'(Jf'Z)'rP,.J-b.,Z}}'K_.FD_,.J_'Z')_F_J';.J-b._
UT..J_,,.14.L,_L,. = U';_J¢_.HZ,_LI "r I'_LFH_U(J_/II.L,_L..
UT_J_,'tI.L_Lj :-: UT,.._t_tH.I._L,t - HLPH_U_.J_ /|I; ..L,,
CDII7 _ I4U_
rtlDOD
Figure A.13 FMP FORTRAN Version of Part of COMP2
A-49
oiUq6u0
_0q?00
_Vq600
Luq9¢0
.V5100
;05_00
_uS_00
_v5q00
.u5500
_05600
_u5700
_v5600
_u5900
¢06000
_06_00
_OC_O0
_OG300
_06qO0
106500
_OG600
_V6700
_G6_00
_uSgO0
_07000
_uT_O0
_0?Z00
_U7300
_07400
_07500
_07600
c
c
3,00
C
c
c
uERTICAL RDUECTIOI| OF THERIIDBYIIRHIC ENER@Y
COItP ._ crtIITIHUE_ BEYONO HERE, THI_ i_ THE END OF THE PIECE _IHULRTED
Figure A.13 FMP FORTRAN Version of part of COMP2 (Cont'd)
A-50
I
o
o
g
t_
0
0
0
0
A. 5.4 Results
Table A. 6 shows some of the assumptions and summarizes the results
of the analysis. Table A.7 shows a more detailed breakdown of the
analysis by subroutine.
This is a worst-case analysis, in that the data dependent branches
were assumed to demand the most computations. _]is was done in
order to estimate the worst-case maximum running time of the GISS
weather code. _ose weather conditions that result in faster run-
ning, such as clouds that reduce the amount of radiation
computation needed, will result in a faster total run for the
whole program. They also result in fewer computations actually
performed.
An interesting detail of the analysis concerns the assignment of
instances to processors. In the prototype compiler, instance
n_nber is computed as described in the section on FMP FORTRAN, and
instance number i is processed in processor n_aber (i mod 521).
That is, at the beginning of the DOALL, processors 0 through 511
are given instance numbers 0 through 511 to do, and then each
processor increments instance number by 512 to find its next
instance to do, until all instances in the DOALL are exhausted.
In CO_IP3, a major contribution to whether a given instance will
run for a long time or a short time is the condition of night vs.
day. Radiation computations are much simpler on the dark side of
the earth. At the equinox, computations would be for daytime
along 72 meridans, and for nighttime along 72 meridians. As the
DOALLs are arranged, with latitude subscript J first, all
processors do daylight instances together, and all processors do
nighttime instances together. This argument is somewhat
oversimplified, because of dawn and twilight effects, and must be
modified for other seasons where all points around one pole are in
daylight, and the other are in darkness. However, more detailed
analysis still confirms that, for the GISS weather, the
straightforward assignment of instance number to processor number
results in nearly equal distribution of not only daytime and
nighttime within each individual processor but also latitude,
thereby helping to distribute the computational effort evenly
among all processors, and tending to make them finish nearly all
together at the end of COMP3.
A. 6 SPECTRAL WEATHER
A. 6.1 Summary
The spectral weather is expected to run with substantially higher
throughput than the GISS weather does. Its fluid dynamics
portions are done by spectral analysis, with each processor
processing an FFT independently of all other processors. (_)r a
discussion of the case that only one FFT is to be executed in the
FMP, see Section A.7.) Thus, the fluid dynamics computations are
much more locally contained, since all the intermediate results in
A-52
TABLE A. 6
I_n_puts Parameters
Grid Size =
Time Step =
Total Time
Total Time Steps =
NCYCLE =
NHOGAN =
NCOMP3 =
Ou_q_ut Result Totals
Fo-i-6--_/EU =
Max Time/EU =
Flops/Systems =
Gflops/System =
89 x 144 x 9
20 minutes
14 days
1008
6
5 Radiation call frequency
3 Physics call frequency
2.88 x 108
4.42 minutes*
1.41 x i0 II
0.532
* Does not include system startup time
A-53
\
+_ '+I
r-i
oJ
0
:m
_J
0)
H
_ 0
tll
,H
e--t
0
Cl
I..l
0
q-t
@ Z
II
@
.IJ
0_.
CU
0 @
v_
II
,.4 0
0
_ m
o 0
e,
o
N
t:_ ,-.m
• ,-i o
flu
m
_D
t_-_
O+ e.._
0
b3
¢.,
0
g
?J.
u
.--i m
.l.J _.4
m_
ON
.,4
++,++_+,+,o,+,o,,
_oooo oo_oo
,,o°,+°o++°o,,,o,
l ! !
ii++o..,+0++,0o++++
ooooooooooooooooo
, .o,**.,,°o°++oo
I I I
I !
ooooooooooooooooo
I
l
t _ , _ ++ mm mP
, ._ _ o_ go+ oo+
t9 tO E-I
;+O
_r
p-<
-,-i
0+'--+
o m
,-_0_
"-_ ,.-_ 00
.,4
_ m
A-54
+'. +;+,..+=___+: ;='_" . .... ,. ..... .
,., v? ++; .. "; _-" + + ....
m_ . ..:, ,:":._ .: ++ + ,_...+..+ _.= +,_ T '+"... +m'" +_.
i
i
!
t
l
the FFT can be contained within processor memory. The chemistry
and physics portions of the spectral weather code are
substantially identical to those of the GISS weather code, and the
analysis of one can serve as the analysis of the other.
Therefore, the fluid dynamics portions of the spectral weather
code are expected to run somewhat better than the fluid dynamics
portions of the GISS; the chemistry and physics portions would
have the same throughput exactly. (This ignores the effects on
throughput of individual programming style, assuming that the
spectral weather style is no better nor worse than the style seen
in GISS.)
The FMP FORTRAN of Figure A. 15 shows the essential portions of
subroutines GDSPCI and FFTFOR. It is clearly efficient. The
inner loop in particular has only singly indexed local variables
and a substantial proportion of multiply-add operations. A char-
acteristic of all these loops is a short string of integer
operations after the index test and before any floating point
operations start. A characteristic of all these loops is a short
string of integer operations after the index test and before any
floating point operations start. The slowdown due to integer
indexing that appeared in BTRI does not appear here except in DO
317 loop in GDSPCI (see Fig. A.15) which is done N times, whereas
the inner loop is done N'LOG N times.
All local variables, including the local arrays, are substantially
less than the 4096 words of address space that is accessible rela-
tive to the stack pointer, and which is reserved to subroutine
local use. Such examples support the decision to have relatively
short address fields within the instruction.
An estimate of 0.6 Gflops/sec was made for the spectral weather
code. This estimate is not yet based on the detailed analysis
performed on the other codes. The estimate is based on prior
knowledge of the chemistry and physics portions of the GISS
weather and an initial evaluation of the efficiency of execution
of the FFT portion as described above.
A. 6.2 FMP FORTRAN Version of FFT Portion of Program
In the MIT spectral weather code, the FFT appears as subroutine
GDSPCI which calls on subroutine FFTFOR. GDSPCI takes NN arrays
of data, splits each array's odd and even parts sy_netrically
about the center index, and rearranges the odd and even parts into
real and imaginary parts of an array of complex values. FFTFOR
then performs NN fast Fourier transforms on the NN )mplex arrays,
producing NN transforms.
NN is equal to the number of layers times the number of meridians.
In a model with 144 meridians, 89 latitudes, and 12 layers, NN =
12 x 144 = 1728 transforms would be performed at once.
A-55
A-56
100000
lOOiO0
i0_00
100300
i00_00
I00500
i00600
i00700
i00800
Z00850
100900
lOlO00
101100
101400
i01500
101500
£0i700
101800
_01900
I0_000
£0_i00
10£300
102_00
I0_500
I0_600
£0_700
I0_800
iOZ_O0
£03000
£03100
103_00
I03300
iO_qO0
i03500
103600
103700
103800
_o39on
£o4ooo
I0_i00
£0_£00
lOq300
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
316
317
c
c
c
FHP F_RTRRN VER_IDN DF F_URIER TRRNSF_RH PffRTI_N
OF HIT SPECTRRL _ERTHER
_UBRDUTINE _DSPCI_DSPEC_DRTRRL_DRTRZH_NLEV)
CQtIHUH /FFT/ _P<7_7_IL_N<g_7)_NTRRNS(16)_LRZ_NLRTNF_
1 N_PRR(_)_L_GN
COIIHUN /FTCET/ N_NLRT
STRUCTURE DRTRRL(_)_DRTRIH(1)_D_PEC(I>
DQHRIN /_PRCE/|IN=Z_N;Z=I_NLRT|L=ZfHLEU
_IHPLIFY C_DE BY RSSUNZNG N EVEN F_R THE TZHE _EIN8
RE_IUN /LRTLU£_<Z=Z_NLRT);(L=I_NLEV/Z>)/=/$PRCE(_sI_L)/
Z_DD = H_D(NLEV_)
NL_ = NLEU/_
C_HBINE THE DRTR FRDH LEUEL• RND LEVEL NL_+_
RNO FRDH LEUE_ _ HZTH LEVEL NL_ ETC, _NTU
THE RERL RNO _HRSINRRY PRRTS UF THE DRTR INDEXED
DN LEVEL, THE FF_ I_ THEN D_NE _N THE COMPLEX
DRTR HHICH I$ THEN UNRRUELED IH SU_RDUT_NE
SPCED1
D_RLL/LRTLV_(I_L;/ _ US_N_ DRTRRL
DD 316 IN = I_N
DRTRIM_IN_I_L) = DRTRRL(ZN_I_L-NL_)
ENDD_/LRTLV_/ ; _IUING DRTRIH
CRLL FFTF_R_DRTRRL_ORTRIH;
DDRLL ,'LRTLU_(I_L)/ _ U_IN_ DRTRIH_DRTRRL
S_ 317 IN = I_N
DRTRRL<IN_I_L_NL_) = ORTRIH<[N_L;
DRTRIH(_ti_I_L_NL_) = U.DO
DRTRRL(INfI_L_ = DRTRRL<_N_i_L) + DRTRIH(IN_I_L_NL_)
DRTR_H<IH_I_L; : ORTRIM(_H_I_L) - DRTRRL(IN_Z_LcNL_)
C_NTINUE
ENDDO/LRTLU_/ _ GIVZNG DRTRRL_DRTRIM
CHLL FORSPC(DSR£C_DRTRRL_DRTRIH_NLEV)
RETURN
END
RLTERNRT£ ENTRY SPC_DI z_ VERY _ZHILRR_ UHITTE_ F_R N_N
Figuze A.15 FMP FORTRAN Version of GDSPCI and FFTFOR
i
i £Oq_OO
10q5o0
£0_600
1o_7oo
£0q9oo
105000 c
£05£00 C
i05_00 c
£05300 c
£05350
105500
_05600
105650
£05700
105800
£05g00
106000
£06£00
£06_00 £_
106_00 c
106400
106500
106600
106700
106800
£06900
107000 £5
£07£00 c
107_00
£07300
£07400
£07500
£C7600 c
£07700 c
107800 c
£o79o0 c
£08000 c
£0_1oo
£o_oo
£0_300
£08400
106500 C
106600 c
£08700 c
i08800 C
_UBRnUTINE FFTFOR(ORTRRL_DRTRXM)
STRUCTURE DRTRRL(1)_ DRTRIH(£)
O_UBLE PRECISION NP_N_N_
CO/IHON /FFT/ HP(7_7_£5)_N(_7)_NTRRNS(16)_LRZ_NLRTHF_
tIGPRR(?_L_GN
CQHHDN /FTCST/ NpHLRT
FOR BREVITY, THIS EXAMPLE X_ SIHPLIFZEO TO THE FQRNRRQ
FFT ONLYf L_RVZN_ THE REVERBE TRRNBFORH TD BE RDDED LRTER
D_HRIN /SPRCE/¢IN=£_N;X=i_NLRT;L=i)NLEV
DDHRZN /LRTLU_/; I:£_I4LRT; L=i_NL_
INRLL /LRTLEU/ DTRL(15)_DT_H(£_)
O_RLL /LRTL_U<I_L)/_ U_NG DRTRRL_DRTRIH_/FFT/_/FTCST/
no 32 d=i_N
OTRL(4) = DRTRRL(NTRRNS(4)_L)
DTIH<J) = DRTRZH_NTRAN_(J)_I_L)
DTIH(NTRRNS(J)) = DRTRIH(d_I_L)
OTRL<NTRRN_<J)) = DRTRRL_JsIS_#
TEHPR = DTRL_xJ-¢) ¢ DTRLk_xJ)
TEHP! = DTIH(_XJ-=) _ DTIH(_XJ)
DTRL(_xJ) = DTRL(_xJ-i) - DTRL(_x4)
DTIH(_XJ) = DTIH(_kJ-£) - DTIH(_XJ_
OTIH(_J-¢) : TEtlPI
DTRL_xJ-_) ; TEHPR
DO 90 ZI = _,LOGN
t&UH = _xxlI
NUHHF = NUH/_
N_S = IA_XX(LBGN-II)
THE ABOVE EeURL_ N/NUH_R$ _HDHN ZN THE ORI6INRL PRO6RRH_
BUT PDNER_ OF _ RRE NUHERIC _HZFT_ tlUCH FR_TER THAN R
DZVZDE
tlUHJK = NUtI_(J-¢)
_L = £+NUHJK
HH = LL_HUHHF
t|OTZCE THE DELETtOH FROH THE_E VARIABLES or OFF_ET_
CORREEF_NDZNG TO THE DDHRIN URRIREL£_ HHZCH APPERRED
_N THE ORIGINAL
Figure A.I5 FMP FORTRAN Version of GDSPCI and FFTFOR (Cont'd)
A-57
£080OO¢
I09000
i09100
£Og_O0
£09300
i09qO0
109500
i09600 C
£09700
109800
109900
£I0000
££0100
llo_oo ¢
££0300 c
lloqoo c
110500
110600
11O700
i£0800
ilO900
lllO00
ii£100 90
II1300
111350 c
II1360 c
111370 C
111380 C
II£_90 C
111480
111580
£I1600 lOO
111700
111750
iiI_00
TEHPR = DTRL(LL) _ DTRL(HH)
7EHPI = DTIH(LL) _ DTIH(HH)
DTRL<HH) = DTRL<LL> - DTRL(HH>
DTZH(HH) = DTIH_LL# - DT_H(HH)
DTRL(LL) = TEHPR
DTIM(LL) = TZHPZ
DO 90 K=_NUHHF
LL = K_NUHJK
HH = LL T NUHHF
HHH = t&SS_(K-_)
H_ = -N_sHHH)
14OTE THRT THE RBDUE NOULD BE CQNDZTIONRL SISN IF REUERSE FFT
CRQSSR = DTRL<HH; X N_I_IIHHJ + DRTRIH<HH_N_
CRgSSI = DTIH(HH) _ N_I_HHH) - DRTRRL(HH)_NZ
DTRL(MR> = DTRL(LL) - CRDSSR
DTIH_HH) = DTIH(LL_ - CRBS_I
DTRL(LL) = DTRL(LL# T CROSSR
DTIH(LL; = bTIH(LL_ ¢ CRDS$I
CONTINUE
Dn IO0 II=I_N
t]ORHRLIZE RND PUT BRCK IN STRUCTURE VRRXRBLES
DZUZDE e'{ _X_LDGN IS R SUBTRRCT £ROH EXPONKN_ s
RUNS HUCH FRSTKR THRN DIUIDE BY N
DRTRRL(IISI_L _ = DTRL<II)/ZXXLD_N
DRTRIH(ZI_Z_L_ = DTZH_IZ) ,' _x_LDGN
C_NTINUE
RETURN
ENDDO/LRTLKU/_ _IUING DRTRRLsDRTRIH
END
Figure A.15 FMP FORTRAN Version of GDSPCI and FFTFOR (Cont'd)
A-58
The obvious, and simple, strategy is to have a DOALL on layers and
longitudes, with each instance performing a serial transform.
That is:
DOALL I=i,144; L=I,NLEV
... here the code for a serial fast Fourier transform
ENDDO
One of the optimizations in the original program needs to be
undone in order to separate the loop into a large DOALL and a
short DO loop. The original version took the multidimensional
arrays that naturally appear in the problem and unwound them into
one-dimensional arrays. Thus, a substantial amount of index
computation was saved by doing the index calculations separately.
In order to make best use of the FMP, the structure inherent in
L_h,l_rcb_q_ :Be_s to be retained.
l_e FMP FORTRAN version shown in Figure A.15 includes the conver-
sion from space variables to complex function, the forward Fourier
transform (complex) on the complex function, but omits the
conversion from complex function back to real frequency functions,
and also omits the reverse Fourier transform, since both of these
are trivially different from the code that is exhibited.
The arrays DATARL and DATAIM in the original FORTRAN version are
used both to hold the entire input and output files of the
transform, and also to use as working space during the course of
the transformation. In this FMP FORTRAN version, two STRUCTURE
arrays DATARL and DATAIM are used to hold the entire input and
output files before and after the transformations, but two LOCAL
arrays DTRL and DTxM are used as the working space during the
course of the transformation.
Each processor is doing one FFT serially. There are as many FFT's
being executed as there are points around the equator times the
number of levels. The code as exhibited therefore would be
efficient only for grids somewhat finer than the 16 latitudes x 24
longitudes of the MIT code as submitted.
The following list gives, in sequence, the variables that are
candidates for being assigned to registers. This list covers
subroutine FFTFOR only.
A-59
%_ik,
i _ ii
- .!r'
A-60
INTEGER REGISTERS FLOATING-POINT REGISTERS
Stack pointer
Base address DATARL
Base address DATAIM
Cycle index for IN ALL
J
Base address of common/FTCST/
N
Processor no., for DOALL control
I
L
N/2 (loop limit)
2*J (common subexpression)
2*J-i (common subexpression)
Base address of common FFT/
II
LOGN
NUM
NUMHF
NSS
NUMJK
LL
MM
K
MMM
2**LOGN (common subexpression)
TEMPR
TEMPI
W2
CROSSR
CROSSI
In addition to the above list, some scratch and accumulator regis-
ters need to be assigned (some double length). As was observed in
the analysis of the explicit code, more integer registers would be
needed to avoid saving and restoring them. The number of floating
point registers is adequate.
A.7 OTHER ANALYSIS
As an example of additional applications, this section will
discuss two application areas that fall outside of the benchmark
programs. The first section discusses how well the FMP would do
on FFT's when only one FFT is being done instead of 512 FFT's
operating efficiently in parallel as in the spectral weather.
A.7.1 Fast Fourier Transforms on the FMP
A.7.1ol Discussion
This section makes some preliminary estimates of the through-
put of the FMP executing a single FFT across the entire array. If
data length is assumed to be a power of 2 and at least 512 long,
the resulting throughput is estimated to lie between 0.6 and 1.0
Gflops/sec. The exact throughput figure is dependent on the
algorithm selected for the FFT. This section is a discussion of
the algorithms, and a description of how they operate. An FMP
FORTRAN version of one of the algorithms is presented.
r_
r-
Algorithms which have the final result stored on "scrambled"
indices were developed to allow in-place computation to save
memory. The data interactions in these algorithms correspond to
swapping data between the upper half and the lower half of some
subset of the data, the subset being a power of 2. At the end,
the scrambled data is stored in memory, the indices are
bit-for-bit reversed (so that 0000011 becomes 1100000), and the
reversed indices are then used to reorder the resulting data.
Other algorithms, such as Glassman's [4], require that the data
interchange in the body of the algorithm be a perfect shuffle.
There is no rearrangement required at the end.
For a 512-point FFT the computations would be fully parallel
across the processors, and the swaps, shuffles, or rearrangements
would take place on all data. For the 512-point case:
For the "scrambled" algorithms, there are 9 swaps and 1
rearrangement.
For the Glassman algorithm, there are 9 perfect shuffles.
For FFTs with more points than 512, the amount of data being
swapped doubles, and the number of swaps goes as log2(N), while
the number of multiplications and additions is proportional to N
log2(N). There are exactly I/2N log2(N) complex multiplications
in the Glassman algorithm, for taking the Fourier transform of a
real variable (since the odd and even parts of the real function
can be combined into the real and imaginary parts of a complex
function defined over half as many points).
Thus, the time required for each of the following needs to be
considered.
o Swapping N/512 items of data
o PERFECTSHUFFLING N/512 items of data
o REARRANGING N/512 items of data
The times for the above would then be inserted into a formula
where SWAPping and SHUFFLing are multiplied by log2N. As a first
approximation, these times would be added to the time taken for
computation to get the net time for an FFT. The result is that
the "scrambled" versions of the FFT run substantially better on
the FMP, since the SWAP is the SHIFTN operator, while SHUFFLing
and REARRANGING are stores to EM followed by fetches from EM.
%
A-61
A-62
A. 7. i. 2 Timing Estimates
Within the inner loop of an FFT, all processors do the same compu-
tations, and hence will stay in synchronism. Any synchronizations
required do not imply any significant time wasted waiting for the
slowest processor.
A SWAP consists of:
N/512 SHIFTN instructions, at 12 clocks each.
A PERFECTSHUFFLE consists of:
N/512 STOREMs. Each STOREM occurs in its proper place
within its own instance; not as a string of successive
STOREMs. Hence processing can be concurrent with the
write to EM.
NEXTDO (the splitting of a DOALL into two successive
DOALLS) requires the termination of the instances, a
synchronization, and the hidden cycle loop of the subse-
quent DOALL. Hence, the following code is executed in
the processor,
IJUMP % end of instances
WAIT % processor side of the synchronization
IMOVEL % cycle loop variable initialization
IMOVEL % cycle loop limit
ITIX % cycle loop
which has a total of 13 clocks (before correcting for overlap
and instruction fetching).
N/512 LOADEMs. If the STOREMs are to EM modules _Jith a skip
distance of 2, the perfect shuffle has the LOADEMs at a skip
distance of i, which is one of the '_magic" skip distances at
which the CN has no conflicts.
The final formula for timing, in terms of number of clocks,
using TEM to indicate the number of clocks per EM access
(include address computation) is_
Tps = 2(N/512)TEM + 13
A REARRANGING of the data on scrambled indices consists of
N/512 STOREMs, all occurring in succession, followed by the 13
clocks of the NEXTDO, followed by N/512 LOADEMs. The STOREMs
are in succession, so EM module busy will keep them at least 9
clocks apart, but the "EM busy" of the last STOREM can be
hidden behind the 13 clocks implied by the NEXTDO.
In addition, the subscriptlng on scrambled indices, bit
reversed, is the worst possible permutation for CNconflicts.
It will take 16 times CN access time plus EM cycle (144
clocks) to get all 512 requests through. These additional 144
clocks are approximately the same whether the bit reversed
scrambled indexing occurs on the STOREM or the LOADEM. A
formula is thus
T r = 2(N/512)(TEM + 144) + 13
The time added to the entire FTT by these operations can now be
computed. Remembering that there are log2N passes through the
inner loop, time for the "scrambled indices" algorithm (after
reducing the formula) is:
T = log2{(N/512)*12 + 2(N/512)(TEM = 144)
For the perfect-shuffle (Glassman) type, time is:
T = Iog2N(N/512)*TEM + 13 log2N
These are the times spent in data rearrangement. In addition,
there are (N iogN)/1024 complex multiplications and additions per
processor, or 4 N log2N floating point operations per processor.
Simulation shows that real programs that are fairly well adapted
to the FMP run at about 1.3 Gflops (AMATRX, BTRI). This is about
9.8 clocks per floating point operation. If the rest of the FFT
does as well, the time spent in computation is 39.2 log2N(N/512)
clocks.
Table A. 8 shows these numbers, together with an estimated through-
put rate (in Gflops) for the FFT assuming that there is no overlap
and that otherwise all the above assumptions hold. An estimate of
35 clocks for TEM was used to cover address computations, CN
delay, and EM access time.
A. 7.1.3 FMP FORTRAN Version of Glassman's FFT Algorithm
Figure A.16 is an example of a FFT coded for the FMP. The
attached FFT is Glassman's algorithm, and does not scramble the
indices.
The FMP FORTRAN in Figure A.16 is a rather direct translation of
an existing ALGOL program (Figure A.17). In translating from the
ALGOL to the FMP FORTRAN, it is likely that the result is not
optimized for the FMP. Specifically, the perfect shuffle in this
particular code consists of fetching the Z items on shuffled
indices into an INALL array, then the NEXTDO for finishing all
instances for data precedence, and then four successive STOREMs.
Successive STOREMs are not overlapped as they would be if mixed in
with computation.
A-63
%
Type of FFT
Table A. 8
Summary of FFT Throughput Estimates
data-shuffling
time (from
formulas in text)
as s umed
comput at ion
t ime
total
t ime
Gflops
(approximate)
"sc r ambled" N=512 466 257
N=I024 956 514
N=2048 1470 1028
N=4096 2008 2056
723
1470
2498
4064
0.474
0.466
0.549
0.675
"G1 as sm an" N=512 431 257
N=I024 830 514
N=2048 1298 1028
N=4096 1836 2056
688
1344
2326
3892
0.498
0. 510
0. 589
0.704
A-64
/'u_
v
This program runs for any binary value of N whatsoever, but is
efficient only for N equal to 512 or greater. [[_,e language is
completely independent of the number of processors.
The ALGOL program of Figure A.17 is a free-standing program, which
reads in a data deck and prints out the transform. It was written
for demonstration purposes, to show that the Glassman algorithm
had indeed been understood and programmed. For the FMP, it is
assumed that some main program supplies the data and uses the
results. Thus, all of the I/O and some of the initialization has
been sloughed off onto this assumed main program and does not
appear in subroutine GLASMN.
A.7.2 A Parallel Sort
Sorting is a common computer application. This section
demonstrates an in-core sort that makes use of all processors at a
reasonable processor utilization. Seldom are the items to be
sorted simply numbers to be sorted by magnitude; however, this is
the easiest example to use to show how the algorithm works. The
algorithm starts from a state in which the items to be sorted are
distributed uniformly among the processors. "Processor" could
mean either processor local memory, or a piece of EM address space
allocated to a specific processor. The algorithm will work for
the number of processors (2 n) equal to any power of 2. The
example will be given for a number of processors equal to eight.
The starting condition for the example is given in Figure A.18.A
The succeeding steps in the algorithm go as follows:
i. Sort the items local to each processor, yielding the
state of Figure A.18.B.
20 Determine the median value globally. One method for
doing this is to guess at a median, and then count how
many items are greater or less than this guess. The
total count is given by means of a SUMALL function on the
individual processor counts. If this guess is not close
enough to the median, one makes a new guess, and finds a
new count. This procedure iterates until a value close
enough to the median is found. Each processor divides
its pile of sorted items intc two parts, one larger than
the median, one smaller than the median. This division
is marked in Figure A.18.B.
Swap parts between processors that are 2m (m = n-l)
apart. The lower nun_ered processor of each swapping
pair sends the higher of its two parts to the higher
numbered processor, and the higher numbered processor
sends the lower of its two parts to the lower numbered
processor. After the swap the contents of the various
processors are like F4guz - A. 18.C.
A-65
_00000
_U0100 C
100_00
_00300
100500
_00600
_0(_650
100700
_00500
100900
_u0950 c
_u0960 C
_00970 C
100980 c
100990 C
101000
101050 C
101060 c
_01070 c
101100
z01_00
I01Z50 C
101_60 C
I0£Z70 c
10130O
101400
101500
i01600
101700
I0±_00
101900
101950 c
10196O C
1O197O C
10Z000
i0_£00
10_00
10_300
IOZ350 C
I0_360 C
I0_370 C
10Z380 C
_UBROUTIN£ _LR_HN _N,;_tI_Lr_G:_N_
A_UttK DYNAMIC ARRRY DECLRRRTIONS
COItHON /FFTDRT/Z_H_)
LDGiC_L _H
;HTEGER 14_ LUG_N
_ERL PH[_D_E
DDIIRIN / KI'( / l IK._.U ,_N-._.
; NRLL/I_K/H
DO|tfilN /JJ/| JuU,_I,I/_-._
INRLL /J/ R(4,
INTEGER I_ K,_ U_ fi
THE _r_LLD|I_NG FDRH |lR_ CHrl_EN T M TREE RDVRNTR_E DF THE FRDE×
CrlHHRND NH_CH I_ 14UCH F'R_TER THRN FD_ FrIR DIVI_IQH rlF
INTEGER FDNER DF _.,
PHI = 6,_31_5_07_/_kt<LOG_:N
THI_ DD LOOP INITIRLIZE_ THE THIDDLE !_RCTORS
DrtRLL J_U!,N-J.; U-_.ING PHI_I,I_-_I.,I
IF" _J,LT,II+"'"_ THEH
CD_;INES ZN LrtlIER HRLF DF N RRRRY
I.,IKJ) = CO'_:4I'-'HI_(J)
ELSE IF 4_N) THEN
14_.J) :J 3. ZI,I(PHI:k(J-,W/_'.)
EL._E
H.,J) :-: -_ZN.,FHI_:(J-N/_))
EHDIF
ENDDn_ GIUING N
_NIT_RLZZRTIDN, IN NHHT FDLLDN_ D>_ RLi_RY$ £(_URL$ 1,1/;"
O = N/_
DO 100 Ji = I_LDGPN
DDRLL/JJ(J)/_ U_: IN6 Z_,N_D
THE DD||RIN _ DIVIDED INTO _ BLBCI.(_ [_P [3 ELEHENTS ERCH
RNO O RRE B_TH PDHER5 OF
Figure A.16 FMP FORTRAN Version of Glassman's FFT Algorithm
A-66
10_500
_0_600 C
10_700 ¢
10_800 C
_0_950 C
I0_960 ¢
103000
_03050 c
L03060 c
£03070 c
_03100
£03300
103_00
103500
lU_600
10370G
103800
_0_850 c
£03860 c
103870 c
£03900
z0q000
104100
£0q_o0
£04300
10q_oo
10q500
10q550 c
£0q560 c
£04570 c
£0q600
£04650 c
10q660 C
£04670 c
10q700 100
.L04800
Z0_850 C
104660 c
10q670 c
10q900
q
Figure A. 16
L£ = d/O
L = ,J - LZAD
d ZN THI_ LUDP I_ E_URL TO I OF THE SERIRL PRDGRRH -1 RND /2
HEI"|CE RLL U_E ElF I IN THE RL6[_L ;_ REPLRGED BY d HERE
EGURLS (OLD V,'Z)/_ 9 R PERFECT SHUFFLE
l< = HDD(J+LiXD+Z_N) - .L
U E(_URLS _nLD U-.L;/::'
U = K . 0
G = LiAb
B£ = Z_Ug.L/AN(_) - Zi:Ug;-_)'_kN(_3+N/_)
_F. = _(U_.;.)AN,.6+N/E> + Z(U_,,xH(_)
R(I) = Z(E(g£,, "r B1
R(q) = Z(l<,;_) - f_;"
RE_'{NCH HERE T_ USE URLUE_ CDItPUTED TO THI_ PDZNT
IIEXTDD
Z(d_.;.) = R(£)
x<o_:') = R(E)
;Z(d'rN/2_,£) : R(3_;
Z(J'_N/P_ P) = H(q,,
EHDDn /dd/9 _ZVZH_ Z
O = 0/_
RLL INTE6ER DIVIDES RND HULTI_LZES BY PnHERS nF _ RRE SHIFTS
_: = PxS
END DF Ji LnBp
CnNTINUE
RETURN
TRRNSFDRHED DRTR IS LEFT IN /FFTDRT/ CDHtlnH9 RLIR_; THE Z RRRRY
END
FMP FORTRAN Version of Glassman's FFT Algorithm (Cont'd)
A-67
BFGIN
INTEGER GAMMA,SPICE,S,V,G.UwD_FI,F_Fb,FX,NUMtJIwM_K|_N,N3 ,
NItJ,LeLRECLpLRFC_NN_F3,SW|,K,LI_I;
PEA L ARrAy ZCO|5|JI,A [O1511),w{Ol_5_IpY{O1255|l
WEA PH C_DEL! RiwH_wTll
FORMAT !_ (XI,_}NPUT ,JS,"R_AL SAMPLEBIPERIOD.,FS._,.UNITS.),
_.(X}&_IN_UT ,JS,_COHPLEX SAMPLES;BERIODW_FBo_wwUNITBN)_
.}tx_w MU_ MANY CO.PONENTSeNe ABE DESIRED AND WHAT SHOULD BE"e
THE INTERVALn/Xb,"BETWEEN SAMPLESp SPACE NU"wjSw
.. . L UDES ,X_, _WFR /X3, OR CPS ,X6, XlOOb ,X_,DB ,XII,"OR CPS",W6,nX$OOO",X_"OB")_
RII(X_,_TH_ INTERVAL I$"_FIO,b_NUNITS")_
FILE CARDCKINO= RE_DER);
FILE LINE(KIN_=PRINTER];
MONITOR LI'_E(O,S);
PROCEDURE G[TDATA;
BEGIN ,
FOR J,=| STEP | UNTIL FX DO
BEGIN
IF((J-FX) LEO O] ?HFN
_RITF(LINE,_FOR K :_ 0 (KI -$)
FOR L 1= 0 STEP I UNTIL
BEGIN
IF(CV NUm*SPICE*N*F$ ] LS$ O) THEN
IF((V-|) _OD (NUM*$PICF) EQL O) THEN
BEGIN
F3I=(F_ MOO N)_ 1;
ZI2*F3-2]I= Z{2*F3-2] + AIL];END
ELSE
LI= LRECLI
END_
REA_CCARD, _13, NI,TI,S_I_N,SPICE,LRECL];IF(G_I EQL O) THEN
WRITE(LINE,RI,N,Tt)
ELSE
WRITE(LINE,_2.N.TI);
WRITE(LINE_R_,N,SPICE)I
PHI I='B.2831853/NI
CI= 20./I.OG(_O.)ILRECI= LRECL I;
DFLF:= I,I(_*Ti*SPICE) ;
F% I= (Nt)DIV (N,SPICE];
IF (8al EQL 0) THEN
hum I= l
ELSE NUM $= 2;
F_ t= (NUM*NI÷L_EE) DIV LRECLI
F_ := (N nlv a) ÷I;
F6 :=(2*N_ 5) OIV O;
HI= N OIV 2;
K1 := LRECL;
GAMMA l= LOG (N) / LOG(?) ÷ ,11
OS=s,.
GETDATA
wRITE(LINE,P8, FOP J s= 0 STEP I UNTIL N3 DO Z{J|)l
FOR j S= 0 _TEP | UNTIL M.I DO
BEGIN
wlJ] :=COS(RMI*J)J
IF(S_I EQL O) THFN
_(J÷_) l= SIN(PHI*J)
Figure A.17 ALGOL Version of Glassman's FFT Algorithm
A-68
i
ELSE
W[J4'H] I= " SIN(PHI*J)/
END;
wRITE(LINE,RS, FOR J /= 0 STEP I UNTIL 2.N-I DO Wtjl);
,oR ' STEP ONTIL oo
FOR LI *'= I STEP I UNTIL S DO
BEGIN
FOR L lu I STEP ! UNTIL O DI1
BEGIN
I*= E*L+2*(LI-1 )*D-l;
K I= 2*((L_tLI-1)*D.2) HOD N)-II
LJI= K+E*D!
I: (LI.'I)*DI
B12= Z [Ull]tW [G]'Z [U] *l_ [G4.M]
B2$=Z [U-1]*N [G4'M| 4'Z[U] _l, [G] ;
A [I'=1] 2=Z[K-t]4"81!
A(I]Ss Z[K|4,BE/
AfI4'N-1]2" Z[K-11"uz;
A[14'NI 2= Z[K|-8_I
ENd7 D,
WRITE(LINF,R8, FOR J 2= 0 STEP 1 UNTIL N3 O0 A[J]);
O 2: O OIV 21
Sl= S.E;
FOR J$= 0 STEP 1 UNTIL N3 DO
BEG_N
E4_J,_,. 2= 'tJ],
ENO/
FOR J 2= 0 STEP I UNTIL N=! DO
BEGIN
Yt((J+H-t) HOD N) 4'1] 2= SgRT(ZCE*j4'I]**2 4' Z[E*J]**2)/(FI*N)I
w[J]|= (J-M-I)*OELFI
EN_RITE[L_NEwR8pFOR"" J!= 0 STEP ! UNTIL N-1 DO Y[J])!
WRITE(LINE-[SPACE u,] )t
WRITE (LINEeR_]I
wRITE(LIN_tSPACE 2])1
FOR J2= F_ STEP 1 UNTIL 00
BEG_Ny * ,,( [2 J 2'] EQL O) THEN
WRITE(LINEeR5eWt2*J..i]oYtEtJ=2],_t;_,J]eY[E_j*.I|eC*LOG(Y[E_j-1]))
ELSE ,,
WRITE (LINEeRb,W [2*J 1] w.Y [2*J.'2] .C_LOG(Y [2"J'2] ),
w tE*J] ,Y tEwJ'I] ,C*LOG(Yt2*J 1] ));
END;
WRITE(LINE[SP_,CE o13!
WRITE (LINEeRI I,OELF) ;
END
Figure A.17 ALGOL Version of Glassman's FFT Algoritm (Cont'd)
A-69
Proc. Proc. Proc. Proc. Proc.
0 1 2 3 4
19
l_2_ L2_
Proc. Proc. Proc.
5 6 7
A. initial State
+ B. Sorted and
Median Marked
28
31
C° After Proces-
sor-to-Proces-
sot Swap
:1 9 Ik
I _ D. Sorted and
Median of Each
Half Markeu
15 -2%--
--fo- 29
13
E. After Proces-
sot-to-Proces-
sor Swap
Figure A.18 Example of a Sort Algorithm Using 2 N Processors
A-70
'.' 29 \_
I I, 'i 30 I\I 28 l:
_ 76 _' 1 32 1 _rted and
+ + Median of Each
Quarter Marked
15 23
--IC 29 '
27 30
G. After Proces-
sor-to-Proces-
sor Swap
Figure A.18 Example of a Sort Algorithm Using 2 N Processors (Cont'd)
3.
4.
5.
Sort again within each processor.
Divide the range over which the median is to
half, and find separate medians for each half.
is now Figure A.18.D.
be found in
The state
Decrease m by one, that is, divide the swapping distance
in half, and swap again. The result is Figure A.18.E.
6. Repeat steps 3, 4, and 5, finding medians over ranges
which are divided in half each time, and swapping over
distances which are divided in half each time, until the
swappping distance is reduced to one. For the example
with eight processors, step six goes only once, producing
the result shown in Figure A.18.G.
7. Sort again in each processor.
Processor utilization depends on the uniformity with which the
data is distributed among the processors in the intermediate
steps, since the data is equally divided among all 2 n processors
both at the beginning and the end of the algorithm. As an
example, consider the sorting between the states of Figures A.18.C
and A.18.D. Assume that the amount of time taken in a single
processor is proportional to NlogN. Processo_ No. 4 has 6 items,
and takes a time proportional to 61og6. Processo_ 0 has 2 items
and takes a time porportional to 21og2. The total time spent
working is proportional to 56.66 while the longest processor time
times 8 is 86.02, giving a processor utilization during this step
of 65.9 percent.
_N!z_
A-72
The actual FMP has 512 processors. Again, in the first and last
step, data is uniformly distributed among the processors. In the
intermediate steps, there will be some spread. If data were
randomly distributed among the processors it would take on approxi-
mately the Poisson distribution, and the amount of data in the
fullest processor could be estimated from that. Given that there
are an average of N items per processor, in the Poisson case the
processor with the most elements would have about N+3N½ elements
for N large enough. For N=I0, a table of the Poisson distribution
shows that one processor Jn 512 is expected to have 19.8 elements,
whereas the approximation for N large gives 19.5.
Finally, an interesting observation: if the items to be sorted
happen to be in inverse order, it turns out that the distribution
among processors remains uniform through the entire procedure, and
processor utilization is i00 percent.
Appendix A
Re ferences
i. C. M. Hung and R. W. MacCormack "Numerical Solution of Super-
sonic Laminar Flow over a Three Dimensional Compression
Corner" AIAA preprint 77-694
2. "A Documentation of the GISS Nine-level Atmospheric General
Circulation Model", Computer Sciences Corporation
3. William D. Stanley, Digital Signal Processing, Reston Publ Co.
4. J. A. Glassman, "A Generalization of the Fast Fourier Trans-
form", IEEE Transactions on Computers, C-19, #2, Feb. 70.
5. Anonymous, FFT/ALGOL, An Algol program compiled on the BSP
projects B-7700 (Oct. 28, 1975.)
A-73
i_r_¸ !
APPENDIX B
FMP CONNECTION NETWORK - ANALYSIS AND EVALUATION
B.I SUMMARY
A connection network (CN) is to stand between the 512 processors
and the 521 EM modules and is to satisfy these requirements. The
connection network would accept requests from the processors,
possibly all 512 simultaneously and establish connections be-
tween the requesting processor and the requested EM module at EM
memory speed. A crossbar switch between processors and memory
modules can provide this function, but at a terrible cost in hard-
ware. It has N 2 crosspoints where N is the number of ports along
one side.
This appendix describes a connection network based on the Omega
network (described below and in [4]. The Omega network has
O(Nlog2N) components, not O(N2). The particular network which
appears to best satisfy the Connection Network requirements is a
duplex Omega network, providing redundancy for additional relia-
bility, as well as providing the required function.
The appendix is arranged in the following sequence. First, some
background information and definitions are presented. Second. the
advantages and disadvantages of providing a CN that satisfies the
requirements of being functionally "almost" a crossbar switch are
presented, especially as compared to the original TN (1,2).
Third, various candidate versions of the connection network are
described in detail, including estimates of relative hardware
complexity of each. Fourth, Simulation results on these candi-
dates are presented, obtained by a functional simulator of the CN
and by a second program called the stochastic analyzer. Fifth,
further discussion of the simulation results in used to narrow
down the selection of C_ to one or two of the cases simulated and
analyzed. Following that, there is a discussion of other CN-relat-
ed topics, including some of the design details that were disclos-
ed by th_ simulation results, and finally, a paragraph of conclu-
sions. The conclusion reached is that sufficient study has been
completed to give confidence in the feasibility of the Connection
Network in the FMP architecture, but that cost/performance trade-
offs deserve to be further considered.
Discussion of the simulators and analyzers has been relegated to
Appendix H.
B.2 BACKGROUND
The connection network can be visualized as a circuit-switched
dial-up network in which up to 512 callers (the processors) are
placing short calls to the 521 callees (the EM modules). Connec-
tions are to be made in tens of nanoseconds, and held £or a few
hundred nanoseconds. Except for the time scale, the action is
B-I
%
Ilike that of the telephone network, and hence, the design of such
a network starts with work done at Bell labs (3). The work of
Duncan Lawrie (4) is especially applicable.
Many slmilar networks have been developed, but which have been
shown to be topologically equivalent to each other (5,9)* One
name, the "Omega" network has been chosen as the term to use for
any of this class of networks. The Omega network is shown in a
form called the "baseline" network in (9).
In the FMP architecture being evaluated, each processor computes
its own address in EM. There is no central location where the
switching pattern for the entire network is defined. All patterns
of connection are possible. Since connections must be made in
tens of nanoseconds, there is no time to take a global look at the
entire pattern, and generate a set of control bits for the
network. Hence, control of the various portions of the network
must be local to those portions.
Several different networks have been investigated, and feasi-
bility of the FMP can be achieved with several of them. The
underlying Omega network design, on which the preferred versions
are based, has 1024 ports on the processor side, 1024 ports on the
EM module side, and ten levels of nodes in between. There are
1024 data paths connecting one level to the next. Each level
consists of 512 two-by-two switches, which are described in more
detail below. The connections between nodes exhibit a pattern of
connections designed to permit .s many processors as possible to
access EM modules simultaneously in parallel for the patterns of
accessing which occur in the aero flow and weather codes.
The previously described transposition network (i, 2), was
centrally controlled, and required two ten-bit control settings,
one of which was the skip-distance of a p-ordered vector (defined
below). The transposition network consisted of two barrel
switches, one 521 wide, one 520 wide. and some appropriate wiring.
Since a barrel switch that is wider than 512 but not more than
1024 wide, can be built of five levels of one-by-four switches,
the TN also had ten levels of logic. All transfers through the TN
must be synchronized to the control settings, and only those
processors whose requests fit the constant skip-distance, constant
offset description could execute during the duration of a
particular control setting.
_-Names include: "baseline network" (5), "binary n-cube",
"butterfly" , "flip network", "Omega network", "reverse baseline",
"simplified data manipulator", "hypertorus", and "SW banyan
ne twor k".
B-2
i,
B.2.1 Definitions
Ce_:tain definitions a_:e necessa1:y in order to under:stand the rest
of this appendix. They are.
B.2.1.1 P-Ordered Vector:
%
A p-ordered vector is a set of EM addresses such that the address
being accessed by the ith p_:ocessor is in EM module number (d +
p'i) modulo 521 where d is called the "offset" and "p" is the
"skip distance".
B.2.1.2 P-Q-Ordered Vector
A p-q-ordered vector is a set of EM addresses such that the EM
module numbe_ being accessed by the ith processor: is (s + p*i + q*
_i/k_ ) modulo 521 where k is the "length of each piece", s and p
are "offset" and "skip distance" as above, and q is the "distance
between pieces". The bottom brackets represent the "largest
intege_ not greate_ than".
Fo]: a system where there are 16 processo,:s and 17 EM modules, an
example of a p-o_dered vector would be fo, the 16 p_:ocesso_s to
request access to the following memory modules _:espectively:
i0 13 16, 2, 5, 8, ii, 14, 0, 3, 6, 9, 12 15, l, 4
where the offset is I0, and the skip distance is 3. Fo, this same
system, a p-q-orde, ed vector with p equal to i. and five elements
pe, piece, the processoJ:s might be requesting from the following
EM modules respectively:
ii, 12, 13, 14, 15, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 1
In this case, p is i, q is 4. and the length of the piece is 5.
Numbers a_e interpreted modulo 17.
B. 2.1.3 Random Request
A set of EM addresses such that the EM module being accessed by
the ith processor is a random variable, f,;om 0 through 520, which
is independent of the module numbe,; being accessed by any other
processor.
B.2.1.4 Blockage
Blockage is the result when two requests try to share the same
path in the connection network, which can then only supply a path
for one of them.
B-3
-:4-
I /"
B.2.1.5 Conflict
Conflict is more than one processor accessing the same EM module
simulaneously.
B.2.1.6 Pileup
Pileup is the number of processor having a conflict at a given EM
module.
B.2.1.7 Frame
A frame is a parcel of data of fixed size sent over a transmission
path. In the CN, each frame is Ii bits; five successive frames
make a data word.
B.3 ADVANTAGES
The advantages of the CN over the previously studied Transposition
Network (TN) (i, 2) are simplification of use) programming, simpli-
fication of the compile):, improved performance, and br oade,
spectrum of applications.
Compile) simplification arises because each processo) computes EM
addresses independently of the other processors. The compiler
need not be aware of the relationships between those add) esses.
No code is emitted to compute offsets, or akip distnaces, or to
control how many LOADEM instructions are issued. No rest)ictions
need be imposed on subscript expressions. All of these _epresent
simplications of the situation for an FMP using the pleviously
studied TN, where the compiler would have had to create an
alte_'nate branch with dummy LOADEM instructions to keep synchroni-
zation, even when a given processo): will skip all actual compu-
tation Ln a section of code containing EM accesses. The connec-
tion network (CN) does not require any synchronization, and thus
eliminates all dummy LOADEM instructions.
When the various instances of the DOALL fetch a set of array
elements that do not form a set of linearly spaced elements, no
user precautions and no analysis by the compiler are required.
Examples of nonlinearly spaced elements are the wraparound on
longitude in the GISS weather, and the offsetting of the index J
in subroutine CHRVAL of the 2D MacCormack aero flow code. In the
baseline system these would not have been allowed. The user would
have had to vector ize them. The independent programming of the
processor can make the FMP more than merely a vector machine, so
this testrictJon, imposed by the TN, represented an iuc_)mpata-
bility with the system objectives.
When the Connection Network is used, system performance would be
affected much less by problem size than when the TN is used. l.'o_
example, consider the fetching of data subscribed with the domain.
variaSles in two-dimensional DOALLs. Say the subscripts are I, ,T
and K. Within the DOALL over I and J, fetches of an array
B-4
)l_ _
i_ ¸_,
i
A(I,J,K) from p-ordered vectors with p equal I. Within the DOALL
over J, and K, fetches of A(I,J,K) form p-ordered vectors with p
equal to IMAX. Within the DOALL over I and K, fetches of A(I,J,K)
form p-q-o_dered vectors with p equal 1 and q equal to IMAX*(JMAX
- i). With the TN, all p-ordered vectors are fetched in one
LOADEM, but a fetch of a p-q-ordered vector required that each
piece of a p-q-ordered vector be fetched with a separate LOADEM
operation, o;' 512/IMAX fetch operations just to load one datum pet
p) ocessor. With the new CN, the number of EM cycles )'equiled for
a p-q-o);de,'ed vector is co,%trolled by the la):gest pileup, usually
a much smaller number than the number of LOADEM instructions
needed with the TN. The pileup for p-q-ordered vecto) s is discus-
sed further in Section B.7.4, together with some simulation
results.
For the specific example that comes from the 3D explicit code
given us by NASA, in the smaller than normal mesh size of 31 x 31
x 31, the improvement is d)amatic, from 17 LOADEM instructions
)equi)ed to fetch A(I,J,K) over I and K, to a maximum pileup of
depth 2. Simu]ation of this case showed that all p_ocessors re-
ceived their data within two EM cylces.
The following development shows thls advantage of CN ovo, TN, in
analytic form. The slowest processor is the one holding up the
synchronization at the end of a DOALL. If access [imc_" were
normally distributed with mean Tar and standard deviatio; :_, then
the wo) st total of N access times, out of 512 such total-, (_ ..
t!_(:f,..st-delayed U, ocesso)) would have a value given in
Equation B.!.
Max Delay = _-Tav+ 3"N½"S (B.I)
i_cause of the cent)a] limit theorem, this formula is valid fol
la)ge enough N without any need fo* assuming an unde) lying normal
distribution. Equation B.2 gives the cor_;esponding formula for
the old TN.
Max Delay = N Tma x (B.2)
Tma × is the time for however many LOADEM instructions are _e-
qui) ed per fetch and may be many tinles Tav. The ,eason fo, the
improvement of equation B.I ovel B.2 is that synchronization among
processoJs for EM accessing is not _equi) ed with the CN, so that
each p)'ocessor continues executing without any wait fo)" the
slowest processor.
A possible wait would exist only at the end of a DOALL where data
precedence may force a synchronization. N in equations B.I and
B.2 is the number of accesses between such waits.
B-5
A substantial gain in use) conveniencewould be achieved with the
CN. All tricks such as adding dummy instances to the DOALL to
make the domain size equal to the a):):ay extents ate unnecessary.
There aJe no "magic" a) ray extents o__ DOALL domain sizes fo3
making the third direction have the same speed as the other two
with the CN. Likewise, all need to distort the algotithm to
regulat:ize the subscript exp) essions would disappea). The code
shown in Figure B.L is an FMP FORTRAN version of some statements
abstracted f_:om subroutine CHRVAL in a 2D explicit code given to
Bur) oughs du):ing the p_evious study. The subscript "J + OI.'FSET",
being the _esult of data dependent computations, would |,ave been
disallowed in the o_iginally pt:oposed FMP FORTRAN (i, 2) because
of the _est3, ictions imposed by the TN. Such a subscript is per-
fectly p_oper in the currently desc, ibed FMP FORTRAN. The many
awkward and a) bit):ary )est_ictions on the language, imposed by the
access pattern limitations of the old TN, a)e not _:equi_ed in a
system using the p_:oposed Connection Netwo) k. Any intege_ exples-
sion can be used as a subsc)ipt.
B.4 CN DESCRIPTION
The Connection Network (CN) has two modes of ope) ation. First,
when the processors aJe independently operating, it would p):ovide
a path from any given p) ocessot: to the EM module of that i?voces-
so,'s choice, without regard to any othe, (up to 511) connections
from other processors to other EM modules. Second, ce, tain func-
tions would be perfo, med in synch):onism, because these functions
a,:e much more economical to implement when the p) ocessots a*e
synchronized. This second class of functions would be done under
coo_dinato_ command, at a time when all p,:ocesso*:s ate in synch) on-
ism. This second class includes
* Broadcast from coo_:dinator to pt:ocesso_s
* "Ha) vest" data f) om p) ocessors to coo_dinato)
* Broadcast f,:om one EM module to all processors
(FETCHEM)
* Swap data between pairs of p_ocesso) s
Various CN design options are based on eithe_ a Benes oI Omega
network. The Benes can make any pe,:mutation of connections be-
tween p_ocessors on one side and EM modules on the othe,, but only
at the cost of having each connection a function of the connectiv-
ity of all othe) s. Opfetman and Tsao-Wu (6) show that the
computation of this "perfect" connection takes on the o) de, of N 2
computational steps (orr N log2(N) if a content add_essible memory
is used). Thus, for making connections in nanoseconds, it is not
possible to compute the control settings of a Benes netwotk for
each set of new EM addresses. Instead, each node of th ,_ netwo) k
determines its own setting, based on some rely simple computation,
with sufficient _edundancy that a path fo_ sufficiently many of
the p_ocessors is set up in the desired access time, with landom
d_stt ibtuion of the excess time among all p|ocesso) s.
B-6
i0
Figure B.I
DOALL,J=I,JM; K=I,KM
... statements ..
IF (condition) GOTOi0OFFSET= OFFSET+ 1
... statements ...
IF (DYX(J,K)LT 0) OFFSET= OFFSET- 1
IF (J + OFFSET.GT. I) GOTO1
... st tements ...
... statements ...
DYX(J + OFFSET,K) = expression
ENDDO
FMP FORTRAN of Portions of Subroutine CHRVAL
of 2D Explicit Code
[_-7
i®
L_J
ILl
,--J
0
L:J j
JLiJ
_lJJ
®
ILl
--J
0
(/)
e o o
000
0
0
0
E
o _ o4 _
0 0 0
0 • @
• • 0
_o
o_
o
z_
00 qD
o
fl)
o
o
.,-.-I
e-i
,M
M
fl)
B-8
Out of several candidate designs, simulations have been used to
indicate the most efficient, i.e., fastest access time, lowest
occurrence of blocking, and smallest parts count.
Figure B.2 shows the first variation considered. In this case, a
1024-wide Benes network (only some of the edge nodes are shown)
has the first 512 ports on the left attached to the 512
processors, and the fi;st 521 ports on the right attached to the
521 EM modules. Detailed examination shows many of the nodes can
never be used. The middle level must have a full 512 nodes to
switch Juts 1024 data paths (at two paths p_.r node). In the
remaining nine levels not all the data paths can reach any EM
module, so that only some of the data paths need be implemented.
Pa_ts counts can be derived from Table B.I which gives the numbe_
of path3 _equired at each level of nodes:
TABLE B.I
Width vs. Layer Number
Layer No. Number of paths (=2 x nodes)
i0 1024
ii i024
12 768
13 640
14 576
15 544
16 528
17 528
18 524
19 522
t
On the side with processor ports, the 512 processors can all fit
into a 256 node, 512-wide path, with the result that of the 1024
paths to the middle, half are unused. Figure B.3 shows a smaller
example with 8 processors and ii EM modules.
Each node is a 2 x 2 crossbar switch and is described in detail in
Section B.4.2. When paths from individual processors to
individual EM modules are set up (the "normal" mode of operation),
each node connects in either one of two ways:
i. PJocessor-side port A to EM-side po, t X, also
B to Y (straight-through).
2. Processor-side po*t A to EM-side po_ t Y, also
B to X (c_ossed).
B- 9
-++_:.'i
o._+.r
,,:.'
.+.
%.;
_++;+;+
-_+_
_'+p+.
++
+++,
-!_!?
-++3':+':
o.o,_ +1
++++B-IO
NOT USED
NOT USED
PROCESSORS
u)
ro
o
0
4_
0
0
0
ro
0
q_
2_
tel
t_
i
I
i
I
i
i
i
i
As long as mode of ope,:ation is "nol:mal", only one bit of in-
fo,:mation is requiled to determine the setting of a node. When
only one pl:ocesso1'-side po1:t has a pending 1:equest, that po1:t
p, ovides the bit of control information. When both po, ts have
pending l equests, one polt must be chosen to plovide the bit of
cont):ol info)_mation. That port is said to have "plio_ity" ove_
the othe, one.
Each node determines its setting flora this bit as follows. If the
bit is ONE, the po_t with the _equest is connected to the lowel
EM-side po*t; if the bit is ZERO, the input point with the ;equest
is connected through to the uppe, EM-side po,:t. The cont3ol bit
is one of the bits of the po,:t numbel on the EM side. The middle
level of nodes uses the most significant bit of output po1:t
numbe_, the two levels on e_the; side of the middle use the next
to most significant bit, and so on to the fi, st and last node
levels which use the least significant bit.
B.4.1 Velsions of Netwo, ks Consideled
Sevelal va, iations on this idea have been devised and simulated.
Figut'e B.4 is a ,evised ve, sion of Figure B.2, showing the enti, e
netwo, k. but eliminating the detailed depiction of each individual
4node. It is, as has been p1:eviously noted, isomo_:phic to a
base-2 Benes netwolk with some of the nodes omitted. P_ocesso,
po,:ts, and EM ports, a_e each packed into the fil:st 512 netwolk
po1:ts on both sides.
Figu3:e B.5 shows the processors spread across every othe, po, t at
the left side of a 1024-wide netwo,:k. The additional nodes hope-
fully plovide some redundancy in the connectivity.
Figure B.6 shows both p_ocessol:s and EM modules spread ac_os_ a
1024-wide Benes. To simulate the sp_:eading of EM modules, t_ans-
form module number M into a new module numbe, M' as in equation
B.3.
M' = 2M 0_< M_< 511
M' = 2(M-512)+I 512_M_ 520
These exp,zessions result when M is shifted left end-around one bit
position.
Figu_:e B.7 shows the same number of nodes as in Figure B.6
a_,:anged as two second-halves of the Benes network. Duncan Law, ie
calls such a second half an "Omega netwo_ k" [4]. The idea is that
if an access is not g,anted thlough the uppe_ Omega, the p_ocesso_
could t_y a few nanoseconds late, through the bottom Omega.
B-If
:,;p
I 19 L£VEL5MIDDLE
LEVEL
I (
I I
I I
I I
I i
t I
I I
,'
I////) '
OMIT,
NEVER USED
I
/////, i
NEVER_E_//_/
i
_024
PATHS
WIDE
1
Figure B. 4 Full View of Figure B.I, Details Suppressed.
Benes Network
512
PROCESSORS
(SPREAD_
EVERY
OTHER
PORT)
L
Figure B. 5 Benes Network
I
I
I
I
I
I
I
521
EM MODULES
J
with Processor Ports Spread
B-12
•: _'_"_ :,,,.., -...............
I"F
%
.,,.•
=,%_ ?
512 (
PROCESSORS " IEM 521MODULES
Figure B. 6
512
PROCESSORS
{ATTACHED
TO TWO
PORTS EACH)
':_. Figure B. 7
4%
& ":t_ I
%_. h r
Benes Network, Both Edges with Ports Spread
_7
I
I
I
I
I
I
I
I
I
I
I
I
521
,"M
) MODULES
Double Omega Network, Two Layers, Each the Second
Half of a Benes
B-13
B.4.2
Figu_:e B.8 shows the basic node of any ve,'sion. Two bidi_:ectional
12-bit-wide paths connect to each side. On the p):ocesso,: side
they a_'e labelled A and B, and the EM side X and Y. Inte, nal
connections may be made fr:om A to eithe;: X or Y, and f_'om B to
eitheJ: X or Y. The 12 bits from processor to EM are used for: EM
module number +i bit plus "strobe" when t_ansfe, s are going on.
The twelve bits ,etur:ning to the p):ocesso_' a_'e Ii bits of data
plus a "latchup" bit. The "latchup" bit is a command to the node
to keel.) this path connected. The "latchup" bit would be tr:ans-
mitred f_'om the EM module upon )'ecognizing a valid ,'equest coming
from the CN, and se_:ves to keep the path connected as long as
"latchup" is true.
Logic in the CN buffer of the processor: uses latchup as the
"acknowledge" that signals that a request has been granted.
Latchup could be dr:opped by the BM module after the operation
being performed ceases to need the data path. Alternatively,
timing could occur • in the CN buffe,:, and the dropping of strobe
could be the signal to the EM to drop latchup.
The ,:esting state is shown in Figure B.9. "Requests", consisting
of EM module numbers, may or: may not be coming out of the
p_:ocessor's, and the connectivity of the node is set up according
to the specified function of module number bit and por't bit (A vs.
B). The "latchup" bit coming back fr:om the EM modules is false.
Connectivity is switched as fast as the _'equests change, since the
initial path connection is pure combinational logic. The command
lines from the CU have a "null" command.
At some time one of the l:equests finds its way to the cor r'ect EM
module, which then emits a "latchup" pulse. Other pr'ocessoJ's must
not disrupt the chosen path before it is latched. Therefore,
there Js a "CN clock" in all processors, with a pe_:iod longe_: than
the round-trip time of the CN, so that new requests ale emitted
only after old requests have had a chance to latch up. The round
trip delay is about 40 feet of wire, plus 19 gates worth of logic
delay going out and 19 gates worth of logic delay coming back. If
gate delays are 1.5 ns (including some allowance for wiJing on
(the boards), and wire is 1.6 ns/ft (Teflon or polyester • belts),
this delay is 120 ns and sets a lower limit to the CN clock
period. This clock is a second timing signal to each p ocessor
(the first is the main clock), not a countdown of the cloci< within
the processor. This timing signal selects every Nth pul_;, of the
main clock.
Figure B.10, B.II and B.12 show va_:ious latched-up states. Figure
B.II shows the "straight-through" connection, Figure B.12 shows
the "crossed" connection.
B-14
k,i,'
-._,
5;
<
/i
=='_.k,
J
.,'..._ .........
PROCESSOR
SIDE
PORT 1
E.W.
SIDE
CONTROL
LOGIC
W- COMMAND
PROCESSOR
PORT 2
EM.
SIDE
PORT 2
* GENERATES ALL SIGNALS LABELLED "E"
Figure B.8 Basic Node
B-15
. ,HI¸
_-_:iI
"BUSY" l
=FALSE |
1
NONE OR ONE REQUEST |
i='A
LATCHUP " FALSE
NONE OR ONE REQUEST
B
LATCHUP : FALSE
Figure B.9
"NULL"
CR COMMAND
NONE OR ONE REQUEST
X
Y
LATCHUP = FALSE
NONE OR ONE REQUEST
LATCHUP = FALSE
Resting State of Node
v
_"="II:TRUE _ CR COMMAND
DATA + STROBE • J DATA "1" STROBE1A _ X _ =""= LATCHUP : TRUE "_ LATCHUP : TRUE
NONE OR ONE REQUEST _ J NONE OR ONE REQUF.51
I B Y F="_ LATCHUP : FALSE LATCHUP = FALSE
Figure B.10 Latchup State, One Path, With Data Transferring
to EM
,_g.!!,
i i_ 'L'
-';:
,.!">=:.
' _..','_,"=_...: ;,::.'!'.:_:_.._'_"'L ' .., "'''"° ',"i_'_);'_:" ''U'-'_/"''_%k:C;A'_"".'0 _,""_!_:!_,...,,;,- .,..' , J.'.,2":_t;;-. '.;,'_"k.',_,'._" _.',_i. '= ,=,C::• '_.._'_
ii
.4
DATA + STROBE
LATCHUP = TRUE
DATA + STROBE
LATCHUP : TRUE
:TRUE /
B
.,,,,=3._
Figure B.II Latchup State,
"BUSY" I:TRUE
¢=,
"NULL"
CR :OMMAND
DATA + STROBE
X ,=3-
Y
"=1
LATCHUP : TRUE
DATA + STROBE
LATCHUP : TRUE
==,,,
,,_,,
Both Paths, Both Transferring to EM
Figure B. 12 Latchup State,
"NULL"CR COMMAND
B
Both Paths, Crossed Connectivity
13-17
o
%
B.4.3 CN Function Controls
The Connection Network serves othe_ interconnection functions in
the p_ oposed system besides the processor-EM paths. Other
functions are controlled f_om the coordinator. A list of CN
states defined by the coo_:d inator 's control is the fol lowing
paragraphs.
B.4.3.1 "BDCST/HVST"
The "BDCST/HVST" command makes a connection from both A and B to
both X and Y. Data from the CR enters all nodes at Y, and by
fanning out to both A and B, will _:each all processors. Data from
the p, ocesso_s enters at both A and B, and is either ORed o|_ ANDed
(it does not matte," which) to be combined at the Y-port that the
coo, dinato;: listens to. This command is used for FETCHEM as
described below, and for HVST.
B.4.3.2 "Null"
With the coo, dinator (CR) node command turned off, the node
ca, ries out its wi)ed-ln function of passing on requests, and
latching up for the "latchup" signal from the EM module as
previously described.
B. 4.3.3 "Wr apar ound"
Connect port A to po_t B. fhis implements the SHIFCN function in
this CN. When the Nth level has a wrapa_:ound command, then evely
processor is connected to the processor whose CN port numbe, is
different in the Nth bit. N is counted from the left in both
Figure B.4, and Jn Figure B.7. The "wraparound" command is used
for processor-to-p_ocessor data swapping. Depending on in which
of the ten levels of the CN is the node getting the "wraparound"
command, data will be swapped between two CN ports which differ
on]v in one bit of their numbe,:. Normally, all nodes of the same
level get the wraparound command, with the result that all CN
[)otts swap data with those ports that differ by just one bit in
the specif _ed bit position. SUMALL for example, can be
implemented by swapping data that is just one apart on port number
("wraparound" on the least significant bit) and adding, then by
swapping that sum two apart and adding four: apart, and so on up to
256 apa, t.
B.4.'J.4 Diagnostic Comma**ds
As described in Chapter 6, Section 6.1, the individual Omega
networks (layers) of the two-Omega network must be tested
separately for diagnostic purposes. Thus, we need a command to
disable one _nega De,work while testing the other one. See
Sectfon 6.1 for additional details.
_j
B-18
i
1
B. 4.3.5 FETCH EM
The FETCHEM instEuction is implemented in two steps. First, the
EM module number is sent from the coordinator as a normal request
after the processors are synchronized (in order to ensure that
processors are not making requests of their own). This request is
accompanied by a command code to the EM module that causes reading
without sending any latchup.
During the access time of the EM module, the coordinator turns the
etnire CN on with the "BDCST/HVST" co,Tmland. Data from the
selected EM module is therefore broadcast to all processors.
Inactive EM modules emit zeroes to be ORed with the data (or ONEs
to be ANDed with the data, depending on how the nodes are
implemented).
B.4.3.6 HVST
The HVST instruction is implemented _ the coordinator settLng the
C_ to the "BDCST/HVST" state, at a time when the CN buffers are
"full" and the processors otherwise idle, and then issuing "go",
which is thL_ command being expected by the CN buffer for dumping
the data into the CN. The data arrives at all EM-side _] ports,
inciudil_g the port that delivers the data to the coordinator.
HVST is intended to be used primarily for the case that only one
process<,r is enabled, therefore, it does not matter whether that
data is ORed with zeros, or ANDed with ones, in the CN. %'he
result is that it shall be left as logic designer's option whether
the words combined during "BDCST/HVST" are ORred or _Ded.
B.4._.7 Coordinatur Access to EM
CR fetches and stores from EM are no different from _>rocessor
LOADEMs and STOREMs. The CR has its own CN buffer.
B.4.3.8 CN to Coordinator Status
Each node emits "busy" = 1 whenever one of its two paths _s
latched up. The condition that no node be "busy" is necessary
before the CN can switch to some other command. Now it ma_ De
possible for the CR to tell, from the state of synch of the array,
when the CN is idle, so there may be no need for the "busy" bit.
Until the rest of the design is finalized, however, a "busy" bit
is assuI_ed.
B.4.4 Implementation Details
B.4.4.1 Flip Flops
Two alter_lative design approaches for the node are:
4;'_ i. No flipflops, just logic that is latched up l y the
=_i_i_ "latchup" signal, as described above.
i
- L_
_,a B-I O
£6,b
.............. : " : "_--'_'_ "_ _'_" ,i"_ _ _, :l'b¼8_: <;_,:_, ,_,,_.[.'-_.Y ,t,_ _,._-,,,, '%
"_,_.' o',..m.,",_:i'_:_!;_,_?.,,' J-f_o'/J,",. :__',.;:'_t_?_:_."_ _'<.'.';_,_;&,m_';.,,z,._.,_.,._-;¢__, . "m:_:,_. ,_n.__|_"_ "'_

2. Path-holding flipflops that are clocked by the same ]20 ns
CN clock that zc. seen as needed for timing the processor requests.
These flipfl(p_ hold the path for just one CN cloe': period.
Approach 2 _: u mo_e gates, but pe3:mits faster access to extended
memory. Fzgu_e B.13 shows the timing diagram for the two cases.
In case i, whele the node is combinational logic only, the EM
module nur,ber contained in the "request" must be held statically
by the p3 9ce_sor until the "latchup" signal returns from the EM
module. Tnen¢ and only then, would the processor be free to emit
an address to,._Id the EM module over the now latched-up path.
In approach 2, the processor emits the request followed by two
frames of address plus operation code. Each frame is ii bits. The
node, seeing one, or two, requests on ports A and B, sets the
flipflops wi_h :he CN clock, so that the address can continue down
the path, if it is possible to reach the EM module for this node.
The EM modul_ gets its address about 80 ns soone_ than it does in
approach i, cutting 80 ns off the access time. These flipflops
will not s_a_, up without the "latchup" signal coming back before
the ._ext clock, thus, if the EM module is not reached, a new path
is ",ee to be set up on the next CN clock.
B.:_4. ' W., _ng
EaCh r)'_d_ is controlled by one and only one bit of the EM module
nu.,_ber in the request of the port with priority. Since all nodes
a)'e to r.e physically identical, the control bit must show up at
the s_me [_hysical location in each node. Thus, previous nodes
must have a wi)'ing pattern for the bits it passes along, such that
they show up in the control bit position after passing th)ough the
correct number of nodes. Figure B.14 shows such a wiring pattern
fot: a 32-wi,Se CN, such as might be appropriate for 16 p) ocessors
and i7 memories. Figure B.15 shows the first few levels of the
512/52l network, showing the connections from X or Y output from
one level to the A o) B inputs of the next. Since the interlevel
cables are belts, where wires must lie parallel, and since all
nodes are _o be identical, these crossovers occu)" on the
paddleboards, not in the belts or within the nodes. (Similar
offset-by-one wiring patterns are seen on some of the Illiac IV
paddleboar6 s. )
B.4.4.3 'Logic
Each node cTntains two 12-wide two-way selector gates, one for
each 9f X and Y outputs, and two 12-wide three-way selectoJ gates,
one tot each of A and B. The third input would take care of
wraparound. Each node also contains some decision making logic.
The inputs to the decision making logic are:
B-20
oB-2]-
I__A ×
--0 0
R R
__B Y
X
A X__
0 O-
R R__
B Y__
Figure B.14 Wiring Crossover Map, 16 Processors To/From
17 EM Modules
-----4m
A
(8)
_f
xf
f
/
x
(y_J
J
J
J
J
Figure B.15
A X
(B) (Y)
//
X
/
A ×
[B) (Y)
/
/ L
A
(B)
I
x
(y}J
J
Wiring Crossover Maps, Full Size System
B-22
"%
* 1 bit of the EM module number (for every node, this is the
least significant bit of the frame seen at %hat node
because of the wiring patterns of Figure B.12)
1 bit to control priority
* 1 bit, the llth bit of the "request" frame could be used
for calling for processor-to-processor wraparound
3 bits, commend from coordinator
1 "Latchup" bit
* 1 "strobe" bit, bit 12 of the processor-to-EM path
(i CN clock, if design 2, with flipflops, is used)
Asterisked items occur on both ports (either A and B, or X and Y),
leading to a total count of 12 signals that have to be combined in
the combinatorial logic. The logic has the following output
signals: Select A or B or both or no-output for X, select A or B
or both or no-output for Y, select X or Y or B for A, select X or
Y or A l[or B, "busy". Only 16 logically different output signals
are needed. "No output", or all lines FALSE, is substituted for
an input request that cannot reach its destination.
Feierbac,_ and Stevenson [5] recommend the following algorithm for
determining priority: If the request at A and the request at B
can both be satisfied, then the node is set to either straight-
through or crossed connection, whichever is requested; if the
requests conflict, the node is set to straight-through, which will
be correct for one of the requests. Thus, A has priority if its
request bit is zero; B has priority if its request bit is one.
This algorithm introduces no bias against either A or B, but means
that certain memory modules will be more easily accessible from A,
and other memory modules will be more easily accessible from B.
If memory addressing averages out in some sense, then this
algorithm is unbiassed.
The priority rule used successfully in validating the CN goes
thusly, for the double Omega. In the upper Omega network, the
upper port of each node has priority; in the lower Omega network,
the lower port of each node has priority. Early simutation
results showed p£iority could not be left to chance, and all
double Omega simulation results reported in the next Section, B.5,
were done with the priority according to this rule. For the
single Omega network, the priortiy was alternated each CN clock
period, and there were an odd number of clock periods l_er EM
cycle. This is slightly more complicated than Feierbach and
Stevenson's rule, but was judged to be less biassed.
B-23
u
i
_ .,
J
=C
B-24
B.4.4.4 Parts Count
The node's parts count, or at least the gate count, is dominated
by the selection gates, tht'ee-input selection gates for p_:ocessor
directed signals going back out of parts A and B, and two-input
selection gates for EM-bound signals out of ports X and Y Using
the Benes network with processor: ports packed into the fit;st 512
prossor-side po, ts and with EM ports packed in the first 521 as
an example, the parts count goes as follows. The fiJst nine
levels have 512 nodes each. For the last ten levels, we can count
the number of nodes pe_: level from the data in Table B.I. The
number of nodes is just half of the number of paths passing
through a given level. Adding together the 19 numbers represent-
ing the number of nodes at each level, gives a total of 5643
nodes. At each node, thet;e are two ports, and a data path that is
12-wide, with 3-input gates in one direction and 2-input gates in
the other. Computation gives
5643 x 12 x 2 = 135,432 3-way 1-input selection gates, plus
135,432 2-way 1-input selection gates, foil a
total of 270,864 selection gates
By comparison, the Transposition Network in the preliminaJy study
[i, 2] consists of two ba,:lel switches, each bidirectional and 9
bits wide in both directions. If these barrels were to be imple-
mented with 2-way 1-input selection gates, they would have
(520 + 521) x 9 wide x I0 levels x 2 di,'ections = 187 380
2-way 1-input selection gates
Parts count of the other va, iations differ accordingly.
If the same level of integration can be achieved in both designs,
then the Benes netwoJk of Figure B.2 should take more chips than
the TN by about the ratio of the number of inputs in these
selection gates, or
135, 336 x 2 + 135, 336 x 3
187, 380 x 2 = 180.6%
To such a node count must be added some additional inc3ease
because of the combinatorial logic in each node, some addition-
al increase because of the additional processor interface requir-
ed, and if the two-layer Omega network of Figure B.7 is used some
additional inc) ease in the EM module to resolve conflicting
accesses arriving from the two redundant networks. On the other
hand, the network controls, simple enough in the baseline design,
are even simpler here since most control is local to the node.
Li
i
i
i
!
i
I
Some of the increase in size is due to the increase in width, from
9 lines in the baseline design to 12 lines hel/e. This increase is
necessa)_y since the EM module numbeJ: plus a st[obe must be trans-
mitted in pa)allel. However, this also would give some speed ad-
vantage; a data word being transmitted in 5 frames instead of 7
bytes.
If only one Omega network, one sheet of Figure B.7 is implemented,
then some means of eliminating the bias against certain p_ocessors
must be adopted. Alte_nating the p3;iority on a regular cycle has
been simulated, and appears to be satisfactory. The suggestions
of Feierbach and Stevenson (5) when adapted to this network appear
to eliminate bias more economically, but perhaps with side
effects; they have not been simulated.
Another variation on the two sheet layer" Omega network would be to
provide, at each node, a path for data to go up or down to the cor-
responding node of the other sheet. To the node logic of Figure
B.8, a path would be added from port B of the other; node, entering
into output gates X and Y, as well as a path from both X and Y of
the other node, entering the outputs at port B. Po, t B is selec-
ted on the basis that it is the low priority port; the high prior-
ity port will always find a path on its own sheet. With a two-
sheet Omega, one p)obably uses fixed p_io2:ities, favoring the A
port on one sheet, and the B port on the other sheet. However,
the hardware on both sheets can be identical because of symmetry.
This variation increases the number of inputs of selection gates
from ten to 14. Since data paths dominate the hardware, this is
approximately a 40% inc,_ease in gate count for the entire CN to
provide these additional paths.
B.5 SIMULATION RESULTS
B. 5. i Summary
Various CN configurations were simulated with the functional simu-
lator and the simulato) ,esults were studied for various indica-
tors of goodness of function. Test cases consisted of filling a
queue of requests in each processor. In some tests all processors
had requests in a given queue position, so that all 512 processors
made requests. The requests in the 512 processor" could form p-
ordered vectors, o_ could be p-q-ordered, or could be random. The
easiest cJ:iterion for performance evaluation is the percentage of
the 512 requests that are granted on the first EM cycle. Section
B.5.3 and B.7.4 go into more detail on the performance after that
first cycle or when only a portion of the processors are *equest-
ing access.
Table B.2 shows, for a number of possible CN designs, this perfor-
mance on the first cycle, and also lists the gate count by number
of nodes and as a pe,_centage of the gate count of the TN [i, 2].
B-25
:'4p"
r,.j
I.-i
0
14-1
Ill
N
,--I
_d
r_
0
,.-.4
q-I
0
P
¢,1
,..-I
,Q
_J
._ao
o
.,_
0
4-1
4-1
,-IcI
I
0
0_._
_0
S-I ..p
O_
O
tJ
0.,
_o_n
i:_ °
_o_
I"-
oo
0
,-I i,rl
iDoo
¢1
o
o _'m¢,1 I-i
o_
_.
m_JO
0
o_ ._o
B-26
BY
The four variations of networks with the best combination of non-
blocking and pa, ts count were
Case i. A two layer Omega Network (Figure B.7) with 361% times as
much hardware as the transposition network.
Case 2. A Benes Network with processors spread (Figure B.5)
with 254% as much hardware as the Transposition Network.
Case 3. One sheet of Omega network with 164% as much hardware
as the TN.
Case 4. A two layer Omega netwol:k but with sheet-to-sheet paths
at each node. This Js estimated to take about 485% of
the hardware of the baseline TN.
i_
i_i¸
_=_ ,
?,...
./J
J
1
Table B.2 also includes results obtained with the stnchastic
analyzer, which computes the probability of blockage within the CN
under the assumption that the input is a random ,equest. Since
the functional simulator could not handle case 4, the stochastic
analyzer results are all that is available for this case. In the
case whe_'e the functional simulator and the stochastic analyzer
were both used, it is seen that the stochastic analyze, agrees
with the simulator results for the case of random requests. For
case 4, the, e are two outputs from the CN to the same EM module.
The stochastic analyze," does not count the two conflicting
requests a,:_iving at these two ports for the same EM module as a
blockage.
The data of Table B.2 is plotted in Figure 8.16, where it shows
the tradeoff between speed vs. amount of hardware, for the CN.
Speed is represented indirectly, as percent success for the case
of all processors requesting simultaneously; hardwa_[e is also
represented indirectly, as a gate count, which in actuality can be
only a rough guide for hardware cost. Three of the fou, cases
previously listed show on this figure as local optima. All net-
works investigated have the 2roperty that a p-ordered vector with
a skip distance of 1 found all 512 paths simultaneously on the
first request.
B. 5.2 Data
The individual simulation averages in Table B.2 are repo_ted in
more detail in Table B.3. The simulator generated either a
p-ordered vector for the 512 requests, or generated 512 random
numbers. Requests could be queued in the processo_'s. In early
runs, Case I, the two-sheet Omega network was simulated by com-
bining two successive cycles of requests of a simulation of case
3, the one-sheet Omega. When the original set of requests is a
permutation containing no duplicate EM module numbers, this is
correct for all cycles. However, when the original set of
requests is random, corrections for multiple accesses to the same
EM module must be made and only the first cycle is co_:rectly
simulated fol: these early results.
B-27
l
X
X
X
o
_W_O
I _ i ....t.... _ .........l
o SS_OOn$ ¢N_0_34o
<1
_0
0
<1
w
_0 _
0
_U
>
_J
C,
0
0
U_
.,H
B-28
i!
]
Nc !.wo,'k
Baseline TN
Benes, Proc.
ports packed,
EM ports un-
packed
(Fig. B.4)
Benes, Proc
ports unpacked
EM ports pack-
ed (Fig. B.5)
Benes, all un-
packed
(Fig. B.6)
Double Omega,
all ports un-
packed
(Fig. B.7)
Table B.3 Summary of Individual Simulation Runs
Ipart c I Type o_ Offset,skip Number of
¢-'(_Ull t ] _ ,_,2(2 t_¢._
3120
equlv.
{!00%_
5643
(180%)
7947
(254%)
9728
(312%)
11264
(361%)
random
131 _ 43! i
123
random
SUCCeSSeS
Ist cycle
5]2
1
5.2
268
]99
136
181
248
average. 40.6%
175
]74
172
170
163
168
173
177
average.
|
Percent Hu_;,L_[ of [
of 512 cycles for]
requests all _12 ]
.IL_2.'.,, 512
[ioo':, 1
j52.4_, 4
38.9L 4
27.6_ 5
35.4_ 4
148.5% 4
p-ordered
random
p-ordered I 246 I 179
random
34.2%
34.0%
33.6%
33.2%
31.8%
32.8%
33.8%
34.6%
.33.5%
p-ordered
random
p-q-ordered _!
with p=l
length of
piece ....
325 63.5%
241 47.0%
average._50j2%
213 41.6%
228 144.5%
._t erage.  .Ot
215 142%
176 34.4% -
I 24_ 1179l 44_ 186.6: ._
/ 17 ! 17/ 333 164.9% 3
/ 0 l 228/ 372 ! 72.7%
308 160.2%
287 159.0%
average.. 59.6%
................ -7 ....................
31 438 J 85.6 _2, -
31 465 90.97_ I 2
3] 443 86.6". -
I00 508 I 99.3% 2
M
100 460 [ 89.9 °, 2
average.. 90.5% J
B-29
+4J
0
U
0
oN
.,-4
H
O
m
ell
e-i
,Q
:'_ B- 3 0
'++'j. . .....................................
'4-1
0
u
o
h
ill
.,-I
tn
o
o
01
t--i
,._
o
o
tl-t i_
_4J
t,.)¢I _.i
tn
Psiin
o_
t_o)
tnu
_o
{.)
sln
tn,.-i
o
I
_ IIII
I
O_ ,_ ,_ _-_ f'_ "_0_ _.0 ',_ ,_ b_ 0 O_ 0
. g
II 11 IIII
o}
_} _,_o
+_ _ _ " .
_0 0 0 _ ®
o c_
I i_ I ,,-..I oJ .el ill i_i
I
O_ _o _
, t/),,:i_ l_l_ {,i_KI:
I
. .% . i.+,
i<,!
i_
The stochastic analyzer data is summarized in Table B.4. The
output of the stochastic analyzer produced detailed description of
the blockage of the network at each level, giving a probability of
a requests blocked at each level. Only the totals are shown here.
In addition, the number of processors making a request was varied
from 256 (50% of the processors) to 512 (100% of the processors).
The body of the table gives the probability of a request being
blocked within the CN, and hence the fraction of requests that are
expected to be blocked.
In the case of the single Omega, any EM module conflict will
result in blockage within the CN. For the double Omega up to two
requests for the same EM module may show up at the output port.
The stochastic analyzer does not count this as blockage.
In addition to the actual 512-processor/521-EM module case, the
st_.chastic analyzer was run for curiosity on a number of other
sizes of arrays, to investigate sensitivity to the exact number of
processors and EH modules. These are also listed.
All the data o£ Table B.4 is plotted in Figure B.17.
B.5.3 Discussion of Simulation Experiments
P-ordered requusts had considerable variation in the percentage of
success, in the functional simulator, as a function of the skip
distance p. Thi_ constrasts with the behavior of random requests,
whose behavior was nearly uniform and independent of the seed for
the random number generatoz. Almost all values of p produce p-
ordered vectors whose percent of requests granted is substantially
better than for random requests.
Certain skip distances (including p = i) are "magic", in all that
EM accesses are attained on the first cycle, with no intf:rference.
Figure:_ B.18 and B.19 show the distribution of percentage success
over the various skip distances tried for two of the networks.
The experiment has a defect; skip distances were not selected at
random, but were partly picked on hunches that said they would be
"magic", with high success rate, or "perverse", with low success
rate. p = 17 was expected to be perverse and it was. p = 228 was
expected to be "magic" (228 is the reciprocal of 16 in modulo 521
arithmetic) but it was not.
In normal operation, processors are not spending full time access-
ing EM. but are spending most of their time doing other things.
Furthermore, since they are processing independently of each
other, processor requests %¢ili often get out of synch with similar
requests in other processors. Therefore, a question of interest
is to what degree does the blocking in the CN become less as the
percentage of requests is less. Two methods were used to investi-
gate this question. First, the stochastic analyzer could be run
with the probability of a request being issued from a given
processor set to various values. These results are shown in
Figure B.20. These results are for single Omega (case 3), dnd
B-31
%1"'_
o
o
r-t
10
o
o
._..I
4_
U
g
N
u
in
0
4-1
O
O
O
O
O
_'O_
t_
m
0
i/)
o
O 0 ._
4_
c:
0
I
I
i
O
0
,_
_O
)-_
".D,
O_
_D
O
O4
G)
.° .
O
O
O
C_
O
_D
O_
o_
O
p_
p-.
O
O
O
_O
O
O O
C_
I_ I_
O O
c,_
0 "_
0 0
0 O'%
uc_ 0",
0", %0
0 0
,-4 (",1
e4 0
u'_ ,-I
,.-I ,-4
u_ u_
(",1
u_
('4
,--I
u_
D_lm
_nO
O0
0
0
,--I
CO
0"_
',.0
0
,--I
(xl
O0
',.0
0
0'%
0"1
u_
'4)
0
kO
C_
',0
0
%O
O_
0
0,1
,-I
u_
0
,-4
O
O_
O
O0
O
O_ ,-4
O O
O
kO O
O O
O O
O
p._
O
O
O
O
¢M
O
r-t
IN
O
_-4
_O CO
O_ O
O O
O
,--I O_
_'1 O
O O
CO
00
O O
g d
O O
O O
'_ _ O
m O O
O O O
_ ¢N r-t
eel o3 _
O O O
O O O
r-I IN O
U'_ trl --,4
¢,1 CN
m t_
.. ! . . .
O_
IzlO
rq
o o
p...
,-4
_0 O,)
o o
O0 u_
CO "4'
O0 u_
e4 C_
0 0
U% ,--I
O,I O,I
C) C)
O_ ',0
0'_ kO
0 0
,'-t 0
._ 0,1
,-4 ,-4
0 0
c_ C_
Ol Ol
O O
B-32
!
z(3.
z
uJ _ --
0
O.
o
4_
N
_-.t r--I
,<
o
4J 0
O_
_O
U_O
trl
t_
1,4
.H
B-33
loo% -
o
0 j
m_
_0
k_t
CC
IJJ
O_
0%
0%
1
I--Y
F
F--
I, / I I I
20% 40% 6o% 80%
PERCENT SUCCESS
1oo%
Figure B.I8 Distribution of Percent Success for p-ordered
Vectors, Benes Network
)00%,
O9
o
O.J
rnc,_
l=-O
0'3
)=g
UJ
O.
o%
0%
1
F
, 7I, ,,,
20% 40 % 60%
SINGLE
OMEGA
PERCENT SUCCESS
DOUBLE I
OMEGA , F
1
V-
I
80% loo%
Figure B.19 Distribution of Percent Success for p-ordered
Vectors, Omega Network
B-34
'X-
I
I
I
!
!
I
/
/
/x
×/
/
/
:,<x/
/
/o
/x
I
I
X
I I 1
0
[ .... i I I I I
LI.J
Or,,
e Z
LLI ._1
O--
F.-t--
×0
S$3:0_)11S1N3D_3d
0
n_
u
t;
.,--t
'0
0
0 _
0
uJ _
mr"
U. 0
m _
_" I/1
0
-0
0
0
0
r_
B-35
double Omega with interlayer data paths (case 4), plus some
non-realistic cases. The second result was by simulation with
only some portion of the processors having CN requests. Most of
the results of the second method were obtained with an early ver-
sion of the simulator which could not be initialized to fewer than
512 requests. However, the simulator did keep retrying all re-
quests that failed to be satisfied on the first EM cycle. Hence,
these leftover requests can be used to estimate the response of
the CN to the situation that only a portion of the processors are
making a request. Figures B.20, B.21 and B.22 are these data for
the double Omega (case I), the Benes (case 2), and the single
Omega networks (case 3) respectively. Data points derived from
requests leftover from p-ordered vectors are marked with X; points
representing leftover requests from originally random requests are
marked with dots; and points marked with circles are cases that
were run with partial random requests after the functional simu-
lator was improved so as to initialize the processor request
queues for an arbitrary n_nber of processors less than 512.
B.5.4 Test Cases Abstracted From the Aero Flow Codes
In two directions of accessing, the aero flow codes produce
p-ordered vectors as access requests. In the third ( "hard" )
direction, a p-q-ordered vector is produced. A full-scale im-
plicit might have dimensionality (i00, 50, 200) leading to p=l and
q=4900=211 modulo 521. The explicit code as supplied (a small-
mesh test case) has dimensionality (31, 31, 31) leading to p=l and
q = 931=410 modulo 521. Several test cases were run using the
double Omega, Case i, which by that time was targeted as the CN
most likely to be recommended, and the results are shown in Table
B.3, where they are called "p-q-ordered with p=l". The first
sheet of simulator printout, on the first CN clock, gives a
printout for the first layer which is identical with the first
clock of a single Omega_ hence, this data is also listed in Table
B.3.
B.6 SELECTION _IONG THE CN ALTERNATIVE APPROACHES
Four preferred approaches to Connection Network (CN) implemen-
tation were listed in section B.5.1. The arguments presented
below show that the double _nega network (case 1 or case 4) is
preferred. Trade-off studies between these two cases are incom-
plete. Table B.5 compares the characteristics of the four prefer-
red cases.
B-36
/
X.
I
/
I
/
I
X
/
_ o/°
I I
0
O
X_
'l
II I
i I
1 1
I
° I /L_
i/uJ
_" iii
0
I
/ I o0 Z
/ o
x /
/ /
/
/ I
I
I
I
o/o
I
I
0
I I I I I
55333n5 tN33_3d
,.=,
r,.- =
OZ
o,._
l.u, _ n'*
bul
,jxO
I I I 0
0
0
0
0q
o
0
g
o
_ 0
_g
m z
h •
.el
_ 4..I
o 0
_J
Z 0
qJ
,--t
o _
mO
ell
"el
B-37
0
0
e.-
X
X
><o
II
1I
/ I
ii
/I
/!
I
ix?
1</
X /
/ /
/ /
/ /
©/J
o//
I×1o
I I
× I I
×1 I
×/_/ o
0/o /
//
/I<o °
0
I I I I
0
Z
L,,J
.-JXO 0
I I I I I
B-38
SS333nS ±N33_34
04
o
o
u9
o
o
0q
I-
oq
uJ
uJ
n..
03
o 0
uJ
Z
h
0
N t_
Z
0
--0
'0
0
qll
U
""t
,_
0
,-4
.u
0
.o
_9
0
O_
tn
Q)
¢q
_m
.,-4
0
-,-.I
0
0
m _
0
m
_ o_
U
0
0
.4
_4
U
_O
_ _o
I..q
0o
O
O
t_
mO
o_ ao
0
t'--
_ * r--I
tD 0_
0
4.1
o _4.1
o'1 ffl
0
I '
14
B-39
g _
u,-
N
FK_'
_N
N
f_
N
_'"'
t_
In Table B.5, the "Hardware" column compares the gate count of
data carrying gates of the various versions with the corresponding
gate count of the Transposition Network considered during the
Preliminary Study [i, 2]. This comparison is used since package
count is subject to uncertainties of packaging, suitable part
availability, etc.
"Random Success" is, for sets of 512 simultaneous random EM access
requests the average percentage of that were serviced on the first
EM cycle.
"P-ordered Success" is the corresponding average percentage for
sets of 512 p-ordered requests. Substantial variation in this
percentage from one set to another was observed, although perfor-
mance was consistantly better than it was for random requests.
The percentage given does not include p=l, the simple vectors, for
which the success percentage is always 100%, as the next line
reminds us.
"P-q-ordered success" gives the percent success observed with
so-called "P-q-ordered" vectors, in which the module numbers come
from the set M i = (i*p+(iDIVk)*q)mod 521. The value of p was
always 1 in the test cases, which come from actual aero-flow
codes.
B.6.1 Discussion of Results
The data in Table B. 3 comes from a simulator which makes 512
simultaneous requests of the EM modules. In actual programs, this
is expected to happen only on the first cycle of the DOALL on the
first EM access. Once some processor has been delayed on
accessing EM memory, it will no longer be in synch with the access
requests of other processors, and so the system should be self-
regulating for all but the shortest DOALLs, with an effective
delay controlled by the average access time observed when some
fraction of all the processors are requesting access to memory.
Consider a program that averages five floating point operations
per EM access (for example, the 2D version o£ the explicit code,
according to the Preliminary Study) [i, 2]. Each EM access ties
up the CN an estimated four CN clocks of 120 ns each, if the
success rate were 100%. Five floating point operations in 512
processors will take at least 1471 ns so that the CN would have
162 requests pending on the average at any given time. Figure
B.20 shows that the percent success with case 1 the double Omega,
is nearly 100% at this level of loading. Figures B.21 and B.22
show about 80% success at 210 requests loading for case 2 the
Benes Network, and 60% success at 270 requests for case 3, the
single Omega Network. The 162 requests are 42.4% of maximum
loading for case i. 210 requests are 65% of the maximum loading
for case 2 Network and 270 requests are 75% of the maximum lo_ding
for case 3.
B-40
Noneof the cases car_'ies a ha3_dware cost greate_ than about one
third that of the set of 516 p_:ocessors. The pa_'agraph above
shows that the simple double Omega (case I) can handle prog_:ams
with as few as five floating point operations per EM access and
still have maJ:gin to accommodate bursts of EM access. Such bursts
a_e planned; we expect a flurry of fetching from EM at the
beginning of many DOALLs and another sho1:ter burst of stores to EM
is expected at the end of many DOALLs.
The double omega Network with interlaye,: paths (case 4) is even
better than the case 1 double Omega at not blocking. Unfo_:tun-
ately, modification of the CN simulator to include case 4 would
have been a major effort, and was not done in time for this _e-
port. Hence the evaluation of this network is incomplete.
The choice between case 1 and case 4 both double Omega Networks,
must take into account a number of other factors if the choice is
to be optimized. Among these are.
* Characteristics of the applications programs. The four
benchmark prog, ams _epresent only four points in applications
space. The main characteristic of interest here is the number of
EM accesses, and their distribution in time.
* Relative cost of the two versions. The gate count is in
the ration of 1.4:1, case 4 with interlayer connections having the
more gates. If the CN chips turn out to be strictly pin- limited,
the extra gates may not cost much at all.
* Ease of diagnosing ha,:d failures. In the simple two-layer
network diagnostics are straight forward, since each single Omega
network, tested separately, is easy to diagnose, as shown in the
Chapter 6 of this report. More complex hardware controls are
needed to make the more complex version as easy to diagnose.
B.7 ADDITIONAL CONSIDERATIONS
The remainder of this appendix considers an assortment of various
behaviours of the CN and aspects of EM accessing. These include
references to or discussions of
* Modular partitioning
* Mapping of module number to CN port numbe_: and spa_ing
* Processor-to-Processor transfers
* EM module conflicts for p-q-orde_:ed vectors
* App,:oximate validity of the assumption of _andom EM
module numbers when EM accesses are queued within the
processors.
B-41
Appendix H contains b_ief discussions of the CN simulato,: and of
the stochastic analyze_ respectively. Listings of the CN
simulator (prior to the insertion of the capability of testing
p-q-ordered requests) and of the stochastic analyzer have been
provided to NASA Ames.
Appendix I contains an analysis of the connectivity of va_:ious
networks which was performed soon afte)_ CN conside,:ations began.
B.7.1 Modular Partitioning
Note the division of an Omega network (FiguJ:e B.23 is a 16 x 16
Omega network) into distinct upper and lowe_: halves after the
first level of nodes, and into quarters after the second. It is
expected that afte_ the second level of nodes: identical qua_:te_s
can be put into each of the four EM cabinets. Thus, the CN would
not physically exist as a single central item except possibly for
the first two levels of nodes.
B.7.2 Mappin_
Mapping is desc, ibed in adequate detail in Chapter 5, and need not
take much space here. The probable mappings are as follows:
The CN ports on the processor side are numbered 0 to 1023. The
first seven bits plus the least significant bit will be called CN-
port-within-cabinet. The two intervening bits are the cabinet
number. Within the cabinet, the processors are numbered 0 through
128, including the spa_e. Processors 0 through 127 are assigned
port numbers as follows; reverse the processor number ezLd for end,
least significant bit to most significant bit position, and vice
versa, multiply this result by 2. The result at this point is
CN-port-number-within-cabinet. Processor 128 is assigned to
CN-po):t-number-within-cabinet No. i. All others have even num-
bered po,_ts.
The CN ports on the EM module side are numbered from 0 to 1023.
There are 525 EM module slots, and hence, 525 CN port numbers to
be assigned. EM module numbers 0 through 511 are assigned to the
even port numbers from 0 through 1022 respectively. The
additional 16 slot numbers are assigned four per cabinet as shown
in Equation B.6.
CN Port No. = 32 x (EM No. modulo 512) + 1 for EMno> 511. (B.6)
On the EM side, _he most significant two bits of port numbe_: are
the cabinet number.
B-42
Ii
i
i
I
i
i
!
I
Figure B.23 16-Wide Omega Network
B-43
ISparing of EM modules would be accompanied by )eplacing a
_;efe1:ence to a failed EM module with a 3:efe_:ence to one of the
spa) es (numbered 521 th_,ough 524) . The ):emapping of such a
):efe)'ence would occu3; in the CN buffe_:. The remapping cal;3;ied out
in the CN buffe); would change up to fou); EM module number, s f|:om
thei) noymal CN destination port numbe); assignments to the CN
destination port number fol the spa):es. Spa)'ing of p);ocesso,;s is
done by designating one as spare, whereupon all processo) s whose
physical location numbe_:s are higher than the physical location of
the spare within the same cabinet, interpret physical location
minus one as theil p) ocessor number.
B.7.3 Processor to Processo); Transfers
SHIFCN using "w):apa_ound" in the simplest way is effective only
when processor in physical location 128 in each cabinet is desig-
nated as the spa_e. The "wrapa, ound" command, as described, makes
connection between two CN ports whose numbers diffe_ only in one
bit position. Although the positions of the bits in a CN pol:t
number a_;e different than the positions of the bits in a p_ocessor
physical location, they are the same bits, 3ear,ranged (swapped end
fo); end and shifted by one). Thus, to get bits of p_;ocesso3
number to correspond to bits of CN port numbe)', we must have the
processors in the first 128 out of the 129 physical slots.
Thus, some modification to the simple "wraparound" described in
the p;:evious sections is called for in order: to accomodate both
sparing and the SHIFCN inst) uction. The SHIFCN instruction is not
used anywhere in the aero flow or weather codes except as part of
the SU_IALL function. In SUMALL, since the use of SHIFCN is hidden
inside system software, deficiencies of SHIFCN could be avoided by
programming. However, the SWAP function will require either a
soluti()n to the SHIFCN problem exposed above, or else a store to
EM fol.'owed by a fetch from ,'ecalculated add_;esses.
B.7.4 EM Module Conflicts on p-q-o_:de, ed Vectors
Failu,:e to access all 512 memory words in pa,allel can be due just
as much to request conflicts, whe,:e seve_al processo):s a_e t) ying
to access one memo) y module, as to CN blockage. Case 3, the
single Omega has the property that all EM module conflicts a,:e
eliminated by a CN blockage that occurs somewhere within the CN.
These Blockages that ,esolve conflicts should not be blamed on a
CN inadequacy, since even a perfect CN will not eliminate the orig-
inal conflicts.
Depth of conflict, or "pileup", is defined as the number of proces-
so):s requesting the same EM module on one CN cycle. Pileup is not
to be confused with the queues of requests within the p,_ocesso_ ,
which could conceivably contain even more requests for the same EM
module, but the_e would not come to light until some later CN
clock.
B-44
P-q-ordered vectors occur frequently in the aero flow codes (in
the "hard" direction). Whenarrays are placed in Extended Memory
with successive elements in adjacent EM modules and when the pro-
cessors are each accessing an element of a p-q-ordered vector con-
current with all the other processors, then EM conflicts of the
sort just described can occur. This situation is discussed below.
Table B. 6
Worst "p-q-ordered" cases
Array Dimensons Pileup Array Dimensions Conflict D,:pth
20 x 26 20 29 x 18 18
26 x 40 13 29 x 36 14
39 x 40 13 34 x 46 16
42 x 62 13 41 x 89 13
50 x 73 Ii 45 x 81 12
43 x 97 12 49 x 85 II
34 x 23
Since any array size declaration picked at random is not likely to
be one of the bad cases, and since the bad cases are all smaller
than the problem sizes £or which the FMP is targeted, the problem
would appear to be a minor one, of the sort most conveniently han-
dled by having the compiler issue a warning to the programme, when
one of the bad cases is seen. The depth of conflict can never be
more than the number of p-ordered pieces in a p-q-ordered vector,
since the p-ordered pieces never have conflicting access
internally.
In Table B.6, the number of conflicts may be different depending
on the order of M, N. Usually, array dimensions (M, N, X) where M
is less than N, have more conflicts than (N, M, X). In the table,
the worst of the two cases is listed.
Figure B.24 shows an example of t_e pileups that occur when an
adverse p-q-ordered vector is accessed from a smaller number oi EM
modules. For example, the number of modules and the number of
processors are both ii. The vector being fetched is M i = (3 + l*i
+ 9"(i DIV 3)) modulo ii for 0 i i0. The top portion of the
figure shows the address space in these ii modules, plotted within
the two-dimensional representation based on module number vs.
address within module. The addresses being accessed are marked
with an asterisk. The lower part of Figure B.24 shows the result-
ing pileul,s. In this case, the worst pileups are of depth three
at module n_,bers five and six.
B-45
,,o6r_
o.,.
").__v_,
B-46
7
6
5
ADDRESS 4
WITHIN
MODULE 3
2
1
0
2
PILEUP
I
33
22
11
o
o
34
23
12
1
l
55
24
13
2
2
36
25
14
_3
3
37 38 }(39
26 )$27 $(28
)$15 ,_16 'W17
)$4 ")('5 6
4 5 6
MODULE NUMBER
")(40
)$29
18
?
7
Figure B.24 Example of Pileup
41
30
19
B
8
42
31
20
9
9
43
32
21
I0
10
%
-_ -_ _r-' 7," '_ _ H i"'_ , _ •
I£ q plus length of piece is nearly equal to 521, then the
successive pieces of vector will tend to coincide in the same EM
modules, generating substantial conflicts. When an arlay has a
dimensionality (M N, X), and the DOALL is on the first and third
ssbsct_ipt, the t:esult is a p-q-or'tiered accessing of that at t:ay
with p:l and q+K=M(N-I). All numbers that are close to multiples
of 521 which can be factored into an M and an N that ate within a
factor of two of each other were sut:veyed. The depth of conflict
in the most-accessed EM module for: each of these cases was
computed. Pileups can also occu) when MN is close to 260 modulo
521. Out of all possible pair:s of numbers M, N that lie within
the above tange, exactly fifteen pair, s generate EM module
conflicts that a_'e 10 deep or more (listed in Table B.6). The
wo;st case is M, N = 20, 26 which yields a depth of conflict of 20
in six memory modules, and which takes 26 cycles of accessing to
resolve, as shown by simulation.
B. 7.5 Non-Randomness
Given a random set of EM module numbers as a request, the, e will
be conflicts at some of the memory modules. After the fi, st cycle
of satisfying the r:equests has occurred, the memor'y module with an
N-way conflict will still have an N-way oY (N-l)-way conflict.
Hence, if thet:e is a succession of r:andom t:equests for: memory in
the processor:s, the leftover requests will tend to bunch up to
some configu_:ation that is worse than a random request. In ot_det"
to test this effect, a test case was t:un with all 512 p) ocesso_'s
each having a queue of three random ,equests. The case 1 double
Omega network, was used for simulation purposes.
Tables B.7 and B.8 trace the history of this test tht:ough the 12
EM cycles that it took to satisfy all p,:ocesso) s. For each cycle
Table B.8 gives the numbe,: of processo_:s t:equesting memory, the
number of memo, y modules over which such a request is expected to
fall (by ref. i), the smaller number: of memory modules that the
bunched-up requests actually asked for, the pet'centage the numbe_
of memory modules actually reached (any diffe,:ence between the
second and third is due to conflicts in the CN), the pe)centage of
non-blocking in the CN), and finally, the length of the longest
pileup obset'ved. Table B.7 gives the history of memory module
conflicts pet: cycle for" this test. Cycle ii included one proces-
so, that was requesting the second item in the pt:ocesso,'s queue
of three items. This lone processor:'s third item constitutes
cycle twelve.
n
!
B-47
Cycle
TABLEB.7 Pilueup History
Numberof EMModuleswith Specific Conflict Depths
Depth Depth Depth Depth Depth Depth Depth Depth
2 3 4 5 6 7 8
1 197 90 31 4 4 1 0 0
2 182 89 26 13 3 0 1 0
3 199 76 30 8 3 4 0 0
4 165 59 19 6 4 0 1 0
5 137 36 13 7 0 1 0 0
6 99 30 I0 1 1 0 0 0
7 74 17 4 0 1 0 0 0
8 52 6 0 1 0 0 0 0
9 20 2 1 0 0 0 0 0
i0 9 1 0 0 0 0 0 0
ii 3 0 0 0 0 0 0 0
12 1 0 0 0 0 0 0 0
From Table B.8 one can see that on Cycle 1 there were 512 re-
quests, and that, if purely random, one should expect these
requests to involve 327.2 EM modules on the average. There were
327 memory modules in this first cycle, whose requests come direct-
ly from the random number generator. In subsequent cycles, there
are always slightly fewer EM modules being requested than one
should expect if the number of processors requesting were issuing
random requests. At cycle 5, there are only 194 different memory
modules in the 282 requests being issued by 282 processo) s, where-
as if those 282 requests were random, one expects 217.9 different
memory modules to be named. This is the worst bunching of re-
quests seen in the whole run.
Whethe, these results are statistically significant was not analy-
zed; they might be within the normal range of random variations.
Whether; significant or not, the indication is that the expected
bunching effect is fairly small.
B. 7.6 Red undanc)/
The double Omega, case i, network has the propery that either half
can be disconnected from the system under coordinator control.
This featu)_e is provided to increase system availability, since
the double Omega with one of its networks turned off is p) ecisely
the single Omega, case 3, and will support FMP program execution,
but at some increase in effective EM access time.
il
! B-48
li
o
G)
_O
t_
co
!
o
Z
o
R
0
,H
rd
H
0
1310
O-r4
'14..I
O_
O)
_ _ ,"-I
t.q ".--"..q
_J .Pl
_OJt_
0 .l-J rn
_0
0
4a o_
o
O.a
_ ._
0 0
o
XO0
_n
O.a t_
o_
• I_00
t
l_ o_ 0 _ _ 0 0 _ I"_ 0 0 0
04 O_ 0 09 1.0 o_ O0 _0 0 0 0 0
0"_ O0 CO O0 _ O_ _ O_ 0 0 0 0
i
I
o0 eo o'1 o_ _._
ol ¢_ 04 04 o_ o_ 04 l_ o'1
_. _ _: _ _. ,/ _ ,_ _ . .
+++ -_
O4 c+4 0,1 ,.._ C'4 oO t.r) _0 _ ,--1 _ ,-I
,-4 ,-_ P-I 0'_ ,CO o'_ <N _' C'g ,--I
t._ _ u') re) cq P,..l P-.l
B-49
B.8 CONCLUSION
The study has shown that the double Omega Network (case 1 of this
discussion) can be expected to give the required performance at
reasonable cost. Its pezformance has been validated by simulation
and analysis. Various options giving either higher performance or
lower cost have also been presented. Additional options were
considered during the course of this study, but were omitted from
this discussion in order to avoid digressions.
Although sufficient study has been completed to give confidence in
the feasibility of the Connection Network in the FMP architecture,
cost/perfo, mance t,:ade-offs deserve to be further conside_'ed.
B-50
°I
!
!
REFERENCL_
2. Fina2 Report, NASF Pre]iminary Study, contract NAS2-9456,
Burroughs, Oct. 77.
2. Fina] Report, NASF Pre2iminary Study extension, contract
NAS2-9456, Burroughs, Feb., '78.
3. Benes, V.E., "Optima] Rearrangab]e Multi-stage Connecting
Networks, Part 2," B.S.T.J. 43(]964) 2642-2656.
4. D. H. Lawrie, "Access and Alignment of Data in an Array
Processor" IEEE Transactions on Computers, C-24(1975)
2245-]]55.
5. G. Feierbach and D. Stevenson, "A Feasibility Study of Program-
mab]e Switching Networks for Data Routing", IAC Phoenix
project memorandum 003, May '77.
6. D. C. Opferman and N. T. Tsao-Wu, "On a C]ass of Rearrangeab]e
Switching Networks", BSTJ, May-June 197], Vo] 50, #].
7. P. J. Willis, Derivation and Comparison of Mu]tiprocessor
Contention Schemes, Computer and Digital Techniques, Vo]. I,
No. 3, Aug ], ]978.
8. J. Lenfant, "Fast Random and Sequential Access to Dynamic
Memories of any Size", IEEE Transactions on Computers, Sept
77, Vo] C-26, No.9.
9. C. Wu and T. Feng, "Routing Techniques for a Class of Multi-
stages Interconnection Networks", Proc. of the ]978 Inter-
;lationa2 Conf. on Parallel Processing, Wayne State U, pub. by
IEEE Computer Society, ]978.
B-51
Ri
APPENDIX C
INSTRUCTION SET AND TIMING INFORMATION
C.I INTRODUCTION
The instruction set has undergone substantial refinement since the
instruction set of the Preliminary Study [1,2]. Additional func-
tions have been identified, including the necessity for hardware
double precision, a "read with lock" operator in Extended Memory,
additional operators for the system software, and so on. The
unsynchronized CN has required substantial changes in the
operators that access Extendcd Memory, including the addition of a
MOD 521 operator in every processor, and the elimination of the CN
controls from the coordinator for EM accessing.
One set of processor instructions is known to be necessary, namely
a set of operators to allow formatting of output, and unformatting
of input. These have yet to be specified. Insofar as the
instruction set presented here still does not have them, it is
incomplete. In evaluating the processor against the banchmark
aero flow codes and weather codes, these character-manipulating
operators are not needed, even though they will be needed in a
final design.
Table C.I is a listing of the instructions. It is divided into
three sections. Processor instructions are in the first.
Commands issued by the coordinator and effected in the processor
are the second. Coordinator instructions are the third. Since
every processor is a serial scalar processor, and can execute
scalar code on data residing in EM, no separate 513th "scalar"
processor is any longer required, nor are the "scalar unit" in-
structions of the baseline system any longer included. Hence, no
floating point instructions are listed for the CR. If floating
point requirements become identified in the system software that
executes on the CR, there will have to be floating point capabi-
lity included in the CR.
Table C.2 at the end of this appendix is a list of the timing of
the instructions. The format used is similar to that used in the
Preliminary Study [2], except that the instruction descriptions
have been moved to Table C.I.
C.2 DESCRIPTION OF TABLE C.I
The Table C.I is a description of the complete instruction set. A
buffer register interfaces the CN, and that the buffer can hold an
address and a word of data for a STOREM, or accept a word of data
from EM on a LOADEM, without interfering with, or requiring assis-
tance from, whatever instruction is in the processor. Hence,
instructions which access this buffer must be able to test whether
C-I
it is "busy", dedicated to an uncompleted LOADEM or STOREM, and
whether or not it is "full". To a large extent, these tests on
"full" and "busy" replace the waiting for "go" in the baseline
system of the instruction set. For example, a STOREM, having told
this buffer to empty itself at the designated memory module, need
not wait for anything more to happen, but the next instruction may
start immediately.
The list has been simplified by using condensed notation. A
"(L,M)" following a mnemonic means that either L or M can be
appended to the mnemonic to create other instructions in which the
designated operand can come from memory, or is literal, instead of
register two. Likewise, instructions with almost identical
descriptions will be combined into a single description.
An "F" prefix designates a floating point operation using floating
point registers, "I" designates an integer operation using integer
registers, and "C" is the coordinator, using the integer registers
in the coordinator.
The symbol "&" designates concatenation. "Next" designates the
register next after the designated one. Names in quotes are
specific control bits. "-_" designates that the data just
described is to be inserted into the location designated just
after.
Major changes from the Preliminary Study [2] are listed in the
following paragraphs.
Most synchronizations are put onto the CN buffer so that
individual instructions are not held up waiting. "I got here" is
set by one instruction, and then usually tested at some later time
to see if "go" reset it, although WAIT and LOOP still wait for
"go". LOADEM and STOREM with the new CN are completely free of
any synchronization requirements, thanks to the CN buffer.
The instructions by which the coordinator causes diagnostics to be
imposed on the processor are more complete in this list than
previously.
The data path directly from coordinator to processor through
fanout boards, of ref. 1 and 2 has been eliminated. Instead, the
coordinator has been given access to a CN port, which can be then
set to a "broadcast" condition where it connects to all processor
parts in parallel. The control path from coordinator to proces-
sors remains.
Double-precision floating point has been included. Double-length
format is two words in single precision format, with an exponent
difference of 36, and with the second word not necessarily
normalized.
Several corrections, such as incrementing before testing in ITIX
and CTIX, also make these instructions differ from the previous
description.
C-2
C. 3 MICROPROGRAMMAB IL ITY
Burroughs, on its own funds, has been building an evaluation model
of a processor similar to the single FMP processor (see Appendix E
of ref. i). This exercise shows that the preferred
implementation, even for a fixed instruction set, will be instruc-
tion decoding by ROM or PROM. Hence, there will be room to modify
the instruction set until fairly late in the design cycle, as long
as the new instructions use the same basic hardware resources as
the defined instructions. Thus, for example, a Newton-Raphson
square root could be included as a microprogrammed instruction,
but the square root algorithm that uses a slight modification of
the divide algorithm would involve a one or two gate change in the
arithmetic chip and could not. Double precision instructions are
microprogrammed from single-precision hardware.
C.4 COORDINATOR OPERATIONS
In all test cases extracted from aero flow or weather codes, the
coordinator has nothing to do for long stretches of time, only an
occasional SYNC instruction to enforce the data precedence
conditions at the end of the DOALL.
On the other hand, the coordinator will have system functions to
perform, such as responding to I/O-complete interrupts at the end
of DBM-EM transfers. These two functions are interlaced at the
same instruction execution station; "all processors ready" is an
interrupt that is allowed in system-function code execution, and
masked off in user code, so that system functions can be executed
during the long waits in coordinator user code.
C.5 FORMATS
This instruction set is presented to demonstrate feasibility of
the FMP. Some of the assumptions underlying this instruction set
could conceivably be changed during the actual design of the FMP.
These assumptions include addressible registers, a desire to
sometimes use absolute addresses and a data word size of 48 bits.
A data word size of 48 bits points to 48 bits and its submultiples
as preferred instruction sizes also. This instruction set assumes
24 and 48-bit instructions. Within 24 bits we get an opcode and 3
register addresses; or an opcode, tw_ register address and a
7-or-8-bit countfield; or a 4-bit opcode, a register address, and
a 16-bit literal. Within 48 bits we can get two address-sized
fields with or without index designations plus one register
address and an opcode; or one address-sized field and two or three
register addresses.
If, instead, one assumes 16, 32, and 48-bit formats, the register
instructions would largely be two-address, either Reg I op Reg 2
Reg I or Reg Iop Reg 2 Accumulator, in order to fit the common
instructions into 16 bits.
c-3
C. 6 ADDRESS ING
The address field (18 bits) consists of either "00" + 16-bit
absolute address, "01" + 16-bit literal, "i0" + 4-bit register
identifier + 12 bits offset, or "Ii" undefined so far.
Absolute addressing is intended to be used only for system soft-
ware and for FORTRAN common. Simple variables and "descriptors"
have relative addresses with respect to the stack pointer, just
llke in B 6700, and 12 bits should be enough. "Descriptions" is
loosely used to refer to base addresses of named common areas and
base addresses of local arrays (or "IN ALL" arrays whose scope is
within the subprogram only).
In test cases, 12 bits was enough to access any element of any
local array. A base address of a local array, once fetched to a
local register, can be used for several accesses to that array.
When a single computed address is not enough then the restriction
to only one register that can be added to the offset creates some
additional integer arithmetic that has to be programmed. The test
cases show enough cases where the programming consists of a single
integer add, as to suggest that a fourth address format ought to
be "ii" followed by two four-bit integer register addresses and an
8-bit address. The saving is one of code file size only, and not
directly in execution time, since the act of adding two integers
together takes one clock whether those integers are specified in a
separate IADD instruction or specified as indices associated with
an address field. Such double indexing would add one clock to the
beginning of any instruction in which it occurred. It is not
included in this description.
C.7 NUMBER OF INSTRUCTIONS
How reasonable is the expectation that the opcode field will be 8
bits? In this list are 174 processor instructions, 64
floating-point-only, 79 integer-only, and 31 other. Character or
string operators still are to be added. There are 100 coordinator
instructions, of which 29 are for system and diagnostic actions.
Some instructions occur very frequently, and it is worth
shortening the opcode to pack them into a smaller word. For
example, IMOVEL and IJUMP are candidates for being 24-bit
instructions. If they are, then their opcode is only 4 bits long,
and they each occupy 16 of the 256 slots in an 8-bit opcode space.
It is not possible to have a floating-vs.-integer bit in the
opcode, and a half-word vs. full-word bit too, leaving 64 instruc-
tions in each category.
C.8 INSTRUCTION EXECUTION TIMING
Timings are given in Table C.2. For the processor instructions
there are four separate functional units involved. Each
instruction has a starting time in each of the three units and an
C-4
Bio
!
ending time or does not use that unit. The time of execution of
each instruction is dependent on its time of occupancy (if any) in
each of the first three independent execution units, namely:
integer unit, floating point unit, and memory controls. The
timing is described most easily with respect to the in::truction
fetching process, which determines the starting time of each
successive instruction. The fourth function unit, the CN buffer,
allows EM fetches and stores to transpire in parallel with other
processing. It executes independently, once started, and does not
affect the starting of the next instruction, but may affect the
starting of the next instruction to use the CN buffer.
Entries in the table have the following significance:
"No. of clock periods" is the number of clocks from when the
instruction normally issues to a functional unit, to the
termination of the instruction. The instruction will always have
been decoded from out of the staging register for at least one
clock prior to this.
"Unit busy" is of the form n-m, where n is the number of the
latest clock that previous instruction is allowed to occupy this
unit, and m is the last clock that this current instruction
occupies this unit.
Some instructions stop the instruction fetching process for a
while, until the coordinator or CN buffer restarts it. The clock
times given for these instructions represent the time from first
decoding such an instruction in the staging register, until the
start of decoding of the next instruction, under the most
favorable circumstances. These are WAIT, STOP, HELP, and any
instruction using the CN buffer.
C.8.1 Instruction Fetch Timing
Timing of the instruction fetching mechanisms can be seen with
respect to Figure C.I. The next instruction is being held in a
staging register. Out of the staging register is decoded the
start times required for the functional units if this instruction
were to start at this clock, and the time it will occupy the
holding register. Also decoded are CN buffer requirements. Out
of the integer, the floating point, and the memory control
functional unit is decoded the ending time associated with the
currently executing instruction. Out of the CN buffer are the "I
got here", "busy" and "full" conditions. The "scoreboard"
compares all inputs. When all comparisons say the next
instruction will not interfere with current instructions, the
instruction is transferred from the staging register to the one or
more functional unit instruction registers. If delayed starts in
other functional units are part of this instruction, the
instruction is passed to the holding register to free the staging
register for the next instruction.
C-5
%
I TRIGGER TO PM
I _ START TIME, INT.
___DECODE I START TIME, FL. OT
L._._ START TIME, MEM
"ISSUE" COMMAND SCOREBOARD
1
HOLDING
REGISTER
(FOR DELAYED
ISSUE)
I
FL. PT. UNIT
INSTR. REG.
'I
I
I END TIME, CURRENT MEM, OP.
END TIME, CURRENT FL. PT. OP.
END TIME, CURRENT INT. OP.
_r
TO DECODING
Figure C.l Instruction Fetching Mechanism
C-6
The program counter always points to the next word in memory after
the staging register contents. Thus, normally the PM will be
holding teh next instruction word statically at its output lines.
Only when the staging register is unloaded in less than three
clocks (the PM cycle) or PM is accessing data will the next word
not appear.
A complexity is the existence of half-word and full-word
instructions. Second halves of instructions words carry the next
half word instruction, so full-word instructions may only have
their first half present in the staging register. The first half
is sufficient to determine the timing. However, the second half
will contain any memory addresses, so when a fetch from memory is
involved, the second half must also De fetched before the memory
part of the operation can start.
Those instructions which contain a memory address (either for data
or as a branch address), or a lite_:al, are full-word 48-bit
instructions. Others are 24 bits. FL, floating literal, is one
and a half words.
The arithmetic timings assume perfect rounding on single length
floating point operations, but that the excess precision makes
rounding unnecessary on double length operations.
Instructions labelled "branch" will cause all iookahead to hold up
until the direction that the branch takes is determined. Branches
defeat overlap. If the branch is taken, there will be additional
five clocks, three for fetching the instruction and two for fill-
ing the instruction lookahead mechanism, bofore the instruction
after the branch can start executing.
An alternative method of providing branching capability is to
separate the testing operation, which sets one or more result
bits, and the branch instructions, which test those bits. This
method has the advantage that one can define a scheme for having
lookahead fetch instructions along the branched-to path, rather
than in the fall-through direction. Since branches are usually
taken, some slight improvement in performance would accrue, in
addition to which the instruction set becomes somewhat simplified.
The instructions CLOADEMN, and CSTOREMN are assumed to be
implemented as a microprogrammed sequence of successive single EM
accesses. These could be substantially speeded up if hardware
were added so that the EM module could recognize these commands as
different from single LOADEM and STOREM, and keep the CN path
locked up.
The CN clock frequency is the third (submultiple) of the main
clock frequency.. With the main clock 40 ns, the CN clock is 120
ns. All passing of addresses and data through the CN will be
synchronized with this CN clock. Thus, the 6, 9, 12, or 15
clocks taken by instructions that pass data through the CN are
C-7
C-8
actually 2, 3, 4, or 5 CN clocks. Operations involving the CN
buffer only, such as loading its registers or testing its
flipflops, can be done un any processor clock, and are not locked
to any one of the three phases of the CN clock. For example, the
instruction FSTOREM takes three clocks to load address and data
into the CN buffer if it finds it free. These clocks do not
depend on CN clock phase. However, the minimum of 6 clocks that
the CN buffer is busy involves sending data to EM, and can be the
minimum of 6 only if the SOTREM loads the data into the CN buffer
at the proper CN clock phase.
For an example of these timing rules applied, see Reference 2.
C.8.2 Coordinator Timing
The coordinator has a similar set of independent units. There is
an arithmetic unit similar to the processor's integer unit. There
is a memory control unit. For accessing EM, there is a CN buffer
unit identical to those found in each processor. The coordinator
also has access to a port on the EM side of the CN, from which it
can broadcast data to all processors, and "harvest" data from all
processors. This second port is part of the arithmetic unit, for
timing, and the compiler will ensure that the CN is idle whenever
the instructions that use this port, mostly the instzuctions that
are included for diagnostics, are used. These are the
instructions from BDCST through READPM in Table C.2. Although
they use the CN, they do not use the CN buffer.
The diagnostic controller is not used during normal program
running. It is used only for diagnostics and system
initialization. Hence, diagnostic controller information is not
required to generate timing information about user programs.
C.8.3 Synchronization
Synchronization enters into the timing analysis in two ways.
First, the instructions that use the CN buffer may test to see
whether "I got here" is up, and may test whether the CN buffer is
"full", "busy" or neither. The actual tests required are listed
in the descriptions of the individual instructions. These
instructions then wait until the CN buffer takes on the
appropriate state before continuing. Some of these instructions
leave the CN buffer with an unexecuted command, such as STOREM
that will be "busy" until the address and data has been
successfully emptied into an EM module, or LOADEM which will be
"busy" until data comes back from EM to make it "full". The
processor will be free to go on executing any instruction except
those which depend on the CN buffer having gotten to the new
state. Some CN buffer states require action on the part of the
coordinator. For example, only after all processors execute
EMFILL can the coordinator execute the HVST instructions. Only
after all processors execute EMREQ can the coordinator execute the
corresponding FETCHEM or BDCST. Only after the coordinator has
executed FETCHEM or BDCST can the processor execute the REM
instruction that accepts the broadcast data.
"4
"x
q:,
c'd<o
!i;i'
o!_),Y
The second synchronization method involves single processor
instructions such as WAIT. The processor checks to see if "I got
here" is down from any previous case. If not, it waits for "go"
to come from the coordinator to reset "I got here". Then the
processor raises "I got here", and waits for "go" before fetching
the next instruction.
C.8.4 Exceptional Cases
Within the processor, all fault cases result in an interrupt to
system software that is resident in the processor estimated at
less than IK words. It is possible to handle some interrupts
without interrupting the CR Floating-point out-of-range detec-
tion does not cause interrupts, but results in setting the
floating-point variables into "infinity" or "infinitesimal" Any
integer overflow causes an interrupt, on the theory that most
integer operations are address calculations and overflow repre-
sents a faulty address. Attempting to insert a number outside the
range _215-i into a 16-bit integer register causes an integer
interrupt; likewise executing a FIXD (double-length integer) on a
number outside the range ±231-1 results in interrupt. Any
detection of error in the error-detection-correction logic results
in processor interrupt. When the error is correctible, the
interrupt merely logs its occurrence and returns to user
processing within a few microseconds.
The processor enters interrupt mode whenever any bit of the inter-
rupt register, not disabled by the corresponding bit of the mask
register, is set. The "interrupt" mode flipflop is visible to the
coordinator, which can interrogate whether any processor is in
interrupt mode. One of the bits of the coordinator interrupt
register is the "all processors ready" signal, thereby allowing
the coordinator to perform system software functions during its
long waits in user program.
Note that there are two lines from the processor to the coordi-
nator that can be called "interrupt" lines. The processor HELP
instruction raises an "interrupt" line that sets the "processor
interrupting" bit in the coordinator's interrupt register. The
"processing interrupt" mode of each processor can be interrogated
by the PINT instruction of the coordinator. In one case the
intent is for the processor to interrupt the coordinator; in the
other, the processor has been interrupted.
C.9 INTERRUPTS
Both coordinator and processor have an interrupt register. Pro-
cessor interrupts are to processor-resident software, for logging
recoverable errors, processor software will return to user proces-
sing within a few microseconds. For non-recoverable errors,
processor software issues an interrupt to the coordinator in order
to shut down the entire FMP. In the processor, the list of inter-
rupts is (with recoverable interrupts identified):
C-9
Single erro_ corrected in processor memory(recoverable)
Double error detected in processor memory
Single error corrected in word received from CN buffer
(recoverable)
Double error detected in word received from CN buffer
Parity error in microprogram word
Memory bounds error
Uninitialized word fetched from EM
Unnormalized floating point operand detected
Integer overflow
Divide by zero integer
Divide by zero floating point
Error detected in logic operation of EU
Software generated interrupt (set by ICALLI) (recoverable)
Illegal Op Code
Floating point overflow and underflow are caught by changing the
word to "unpresentable" (or loosely, "infinity") and "infinites-
imal". Divide by floating point zero also results in "unrepre-
sentable", so for some purposes this interrupt would be masked off
as redundant. There is a control bit which determines whether
integer underflow results in infinitesimal or zero. The single
error corrections are serviced by a routine resident in the
processor which logs their occurrence. Return is to the user
program. Most other interrupts will result in program termin-
ation. It is the design intent to save the memory address and the
corrected bit number for error corrections and the memory address
of double error detections.
In the coordinator, the interrupt register has the following bits:
EM module error EM module parity error data in (address)
Single error detected in coordinator memory
Double error detected in coordinator memory
Single error detected in word received from CN buffer
C-10
Double error detected in word received from CN buffer
Parity error in microprogram word
Processor interrupt (sent by processor)
All processors ready (interrupts system software to get user
program's instruction executed)
Memory bounds in coordinator memory
Illegal opcode
Memory bounds in EM
Support processor interrupt
DBM result descriptor ready
Diagnostic controller interrupt
Timeout, no instruction executed for the last X ms.
Interval timer count down to zero
Integer overflow
Divide by zero
Logic error detected in coordinator operation
DBM controller error detected
Software generated interrupt (set by CCALLI)
Unrecoverable interrupts enter interrupt processing at address 0.
Recoverable interrupts (single error corrections and ICALLI in the
processor, in the coordinator single error, CCALLI, interval
timer, support processor interrupt) enter interrupt processing at
a second, hard-wired address. In the coordinator, the "all
processors ready" interrupt has its own hard-wired address.
Processor interrupts interrupt to processor-resident software;
coordinator interrupts interrupt to coordinator-resident software.
C.10 SUBROUTINE ENTRY AND RETURN
A description of how this is done.
Environment
Two integer registers are permanently degignated as
C-If
SB The pointer to the "base" of the address space for the
current subroutine
SL the pointer to the limit of this address space
(actually points to the first word beyond the
allocated space).
"S" stands for "stack", since space is allocated as a stack.
Fig. C.2 shows this stack.
C.lO.l Subroutine Entry
Prior to the call, the variables temporarily held in registers
must be stored back in PM if there is any chance the called subrou-
tine will reference them. Registers that the called subroutine
will use must also be saved. The compiler simply stores every-
thing back to its "home" address in PM.
At the place pointed to by SL, the caller next writes any para-
meters passed by value (where this is allowed in our FORTRAN), and
the base addresses of any arrays being passed, and the descriptors
of any named common areas. There are P words in this area, where
P is known to the compiler.
Next, the CALL instruction is executed. It does the following:
i. The content of SB, SL, and program counter
concatenated and written into address P+SL.
are
2. Register SB is loaded by SL + P
3. Register SL is loaded by the new value of SB plus a
literal, the space allocation known to the compiler.
CALL therefore has two parameters, the number of parameters
passed, and amount of space allocated. In ANSI FORTRAN 77, both
of these would be literal fields in the instruction. For some of
the dynamic array sizes that are allowed in FMP FORTRAN, it will
be necessary to insert code to compute the size, and leave it in
an integer register. The absolute program address is computed
from the content of the branch address field and inserted into PCR
for fetching the next instruction.
C.I0.2 Subroutine Return
RETURN executes as follows:
i. Fetch the word addressed by SB
2. Unpack that word into SB, SL, and the program counter.
C-12
SUBB LOCAL
VARIABLES
RETURN PCR [ SUBALIM [SUBABASEAOO
SUBB PARAMETERS
8 DESCRIPTORS
ANY NAMED COMMONS
SUBA LOCAL
VARIABLES
RETURN PCR [ MAIN LIM MAIN BASEAOD
MAIN LOCAL
VARIABLES
/
,,,E.RU,,,_R_/////////X/////////_
BASE ADDRESSES OF
ANY NAMED COMMONSIN MAIN
SUBB LIMIT
SUBA LIMIT
LINK
SUBA BASE ADDRESS
PARAMETERS & NAMED COMMONDESCRIPTORS
MAIN LIM
LINK)
MAIN BASE ADDRESS
Figure C.2 Subroutine Stack
C-13
BLANK
COMMON
STACK
OF
NAMED COMMONS
LOCAL
VARIABLES
AND
PARAMETERS,
SUBROUT| N E
RETURN
STACK
,DATA AREA
Figure C.3 Stack Allocation in the Data Area
C-14
!
t
i
-7>
m
COMMON/C/
" SIZE _/'//////JNAME I IN USE COUNT, ,
COMMON/B /
" S_ZEIV////.dNAMEI,NUSECOUNT
COMMON/A /
, SIZE V/////_INAME i INUSE COUNT
_MED _,
DESCRIPTORS FOR/C/POINT HERE
<=------ DESCRIPTORS FOR/B/pOINT HERE
DESCRIPTORS FOR/A/POINT HERE
"_'_X SECOND STACK LIMIT
Figure C.4 Organization of Named Common
C-15
If the subroutine is a function, the results of that function will
be left in a single-length or double-length integer or
floating-point register as appropriate. The register is
determined by convention, and is the same for all functions of the
same type.
C.I0.3 Within the Subroutine
Working space is addressed by positive offsets relative to SB. We
have 12 bits of address that may be added to an integer register
as part of the normal addressing machinery. When 12 bits is not
enough, the compiler will have to use integer instructions to
build the address.
Parameters and base addresses of named common areas are accessed
by negative offsets from SB, as implied in the description of
entering a subroutine in C.10.1.
C.10.4 Addressing
With the above structure, absolute addresses may be used for
simple variables in the main program and for blank COMMON.
Varying degrees of indirection are implied, the most complicated
case being an element of an array in a named common in a
subroutine, where an offset from SB is used to find the base
address of the named common, an offset from that base is the base
of the array, and the element is offset from the array beginning.
(A smart compiler may combine the last two into a single offset
and will fetch the base address of a named common to an integer
register upon its first use.)
C.10.5 Named Common Mechanism
A second stack of space is allocated to all the named commons. If
the first stack grows by increasing addresses, the second stack
may grow by decreasing addresses. For example, see Figure C.3.
At address zero of each named common is a count of how many
subroutines are currently active which name that common. Each CALL
goes through the descriptors in the parameter area and increases
each count. Each RETURN goes through the descriptors in the para-
meter area and decreases each count. A named Common used in all
subroutines lasts the entire run, therefore as does blank common.
The words at address zero also contain the size of the named
common, so that they form a relatively-addressed linked list to
each other. See. Fig. C.4. Whenever the count goes to zero at
the last named common, the stack limit of the second stack is
decreased to the first non- zero count.
In ALGOL, where addressing environments are nested in lexic
levels, the above mechanism always releases space upon the exit
from the last lexic level that needs that space. In FORTRAN, if
we adopt the above mechanism, it is possible to undefine a block
of space inside this second stack, but it won't be released until
the spaces "above" it in the second stack are themselves released.
C-16
A named common disappears whenever no subroutine owning it is
active. A named common descriptor will either be found in the
calling subroutine, upon subroutine entry, or must be created.
i
Thus, the presence of the appropriately named descriptor in the
calling subroutine causes the descriptor to be copied; while the
absence of an appropriately named descriptor causes new space to
be allocated, and a new descriptor to be created.
A provision for statically allocated common areas, that survive
for the life of the job, can easily be made if desired. They have
been omitted from this description because such statically allo-
cated variables occupy needed space during times that they are
inactive, and because such static allocation, outside of blank
common, is not needed for compliance with FORTRAN 77. In the 3D
implicit code, as explained in Appendix A, the maximum mesh size
would be smaller if all variables were statically allocated.
C.10.6 Arithmetic Details
The design intent is to provide perfect rounding. A floating
point number is a discretized representation of an assumed under-
lying real number. When two floating point numbers are combined,
the result is to be the closest representation possible of the
real number result from combining the two underlying real numbers.
Thus, whenever the guard bits are less than one half a least
significant bit, the surviving part of the mantissa shall be left
alone. When they are more than one half a least significant bit,
one is to add 1 to the surviving part of the mantissa. In the FMP
processor, a full double length accumulator cannot be justified.
Therefore, when the eight guard bits are exactly one ONE followed
by ZEROs, they may represent one ONE followed by seven ZEROs,
followed by additional unknown bits, hence we round by adding 1 in
the least significant place whenever the most significant guard
bit is ONE. Alignment of addends is done in one clock, with a
barrel, hence the implementation of a "sticky" bit represents
substantial hardware investment.
Guard bits and rounding are used to preserve precisions in single-
length arithmetic (36 bit precision, but not in double length (72
bits precision), giving roughly ii decimal digits and 21 decimal
digits of precision respectively.
Rounding occurs after normalization. Since we have six-or-eight
guard bits, rounding is a no-op when normalization requires a left
shift of more than six-or-eight places. The guard bits, shifted
into the result by the normalization, protect precision
effectively.
C-17
Rounding after addition can be simplified by observing that when-
ever mantissa overflow occurs after the rounding, the resulting
mantissa must be .100000 . However, we have to add one to the
exponent. Since the exponent adder is not otherwise busy during
rounding, we have exponent in the result register and exponent +I
being presented at the output of the exponent addez, so that, if
rounding overflows, exponent +i is loaded into the result exponent
field, while .1000000000 is loaded into the mantissa, all without
requiring any additional clock.
A zero result that may get rounded away from zero is a special
case. The sign and exponent of an apparently zero result must be
saved until after rounding, to accommodate the case that the
result will be rounded away from zero. All zeroes have positive
sign and the smallest allowable exponent.
In multiply, normalization and rounding are done together in one
clock. A product never overflows, and normalization is by either
no or one place. Add 1/4 of a LSB to the product on the last cycle
of multiply using the carry input to the second guard bit. Thus,
if normalization by one place is required, the product is already
rounded. If normalization is not required, add another 1/4 of a
LSB. Only 2 guard bits are needed at the end of multiply (al-
though more are needed to keep the partial products honest during
the formation of the final product). Thus normalization and
rounding take one clock altogether. At that last clock: norma-
lize the already-rounded product if the leading bit is ZERO;
select the output of the adder (Result + 1/4 (LSB)) if the leading
bit of mantissa is ONE.
C.i0.7 Other Instructions
Some operations are implemented as simple by-products of the
instructions in Table C.I. By-product instructions include:
Convert from single-precision to double-precision. Given by
FADDXL, literal=zero.
Convert from double-precision to single-precision.
the first half only of a double-precision word.
Address
Divide (multiply) integer by power of two. ISHN(L).
Extract fraction-part from floating-point word.
literal = zero. Useful in mathematical functions.
FMOVEXL,
Half-word and full-word No-ops.
C-18
TABLEC.l
ProcessorInstructions
moatin_
FADD(M,L) ,
FSUB(M,L)
FMUL(M,L)
FDIV(M,L)
FDVR(M,L)
FMAD(M,L),
FSUB(M,L)
FSSQ(M)
FADEX(L)
FMOVEX(L)
FABS(M)
FNEG(M)
FADDX(M,L) ,
FSUBX(M,L)
FMULS
FADDD, FSUBD
FMULD
FL
Reg I plus (or minus) operand -_ Reg 3
Beg I times operand -_ Reg 3
Reg I divided by operand -_ l_g 3
Operand divided by Reg I -_ l%eg3
Beg I times operand is added to (subtracted from)
rag3 -_ _g3
}_egI squared plus operand squared -m Reg 3
_d operand (if literal, may be limited to 8-bit literal in
countfield) to exponent field in fl. pt. reg. If operand
is in register, it will be an integer register.
Transfer operand (from int. reg., or literal) to exponent
field in fl. pt. reg.
(in both the above instructions, the operand is in integer
format and will be converted to floating point exponent
format during the course of the instruction.)
loperandl -_ Beg
minus the operand -_, l_j
Reg I plus (minus) operand -* Reg 3 & next (double-length)
Double length product of _gl and Reg 2 -_ Reg 3 & next
Double length s_ (difference), l_g I & next (op) Reg 2
& next -_ Reg 3 & next
Double length product of two double-length operands. I_i
& next * Reg 2 & next -_ Reg 3 & next
The 48 bits following this opcode -_ Reg 3
C-19
FMOVE(M,L)
FPAKM
FUPKM
FIXD
FIXF
FIXC
FINFLZ
FIXEX
FMT
FMTI
FLOAT
FLT(L,M),
FLE(M,L),
FGT(L,M),
FGE(L,M)
O_erand-_ Reg3
Most slg. 24 bits of Reg I & most sig. 24 bits of
Reg 2 --> memory
From memory, the most sig. 24 bits of m_nory w_rd & 24
zeroes -_ Regl, the least sig. 24 bits of memory word
& 24 zeroes -_ Beg 2.
R_g -m memory.
Convert operand in fl. pt. RRg. to nearest rounded
integer value -m int. Reg.
Convert operand in fl. pt. reg. to nearest rounded integer
value--_ int. Reg & int. Reg.+l
Convert operand in fl. pt. reglto integer whose absolute
value is the largest possible but not larger than the
absolute value of the original operand. Result -_ int.
Reg 2 . (floor )
Convert operand in fl. pt. Beg I to integer whose absolute
value is the smallest possible not maaller than the
absolute value of the original operand. Result -_ Int.
Reg 2. (Ceiling)
If l_egI contains "infinitesimal", zero -_ Reg 1
Convert exponent in fl. pt. reg. to integer format -p int.
Reg.
Convert content of fl. pt. reg. to floating point format
used by the B-7800. (Will be microprogrammed, and will
use logic in the integer unit.) If "unrepresentable" or
exponent out of range, interrupt.
Content of fl. pt. register is ass_ed to be in B-7800
floating point format, and is converted to internal FMP
floating point format.
Convert integer in int. Reg. to fl. pt. format-_ fl.
pt._.
If I_ 1 tests LT (or LE, GT, GE) operand, then GOTO
branchaddress.
C-20
z,_-r- ._.,_:_ ¸ ,,_ • ,#_-?_ _ •• _ _ -- • _ v •
FEQL
FLTD
FGTD
SETFL
SETZ
IADD(M,L),
ISUB(M,L)
IAI1)l, ISUBI
IMDL(M,L)
IDIV(M,L)
I_O_(M,L)
IMOD521
If ist 16 bits of Beg equal 16-bit literal, GOTO
branchaddress. _his yields tests for zero, "uninitialized",
"infinitesimal" and "unrepresentable/infinity", since these
are all encoded in ghe exponent field. No floating-point
word with zero exponent is allowed except zero itself.
(Tests for equal in floating point have otherwise been
eliminated as useless and misleading.)
Double-precision compare. If Regl & next is less than
Reg 2 & next, GOTO branchaddress. (Reverse registers
for .GE.)
If Reg I & next is greater than or equal to reg 2 & next,
GOTO branchaddress. (Reverse register addresses for .LE..)
Set infinitesimal control bit. Exponent underflow there-
after results in "infinitesimal".
Reset infinitesimal control bit. Exponent underflow
thereafter results in zero.
Beg I plus (minus) operand -_ Reg 3
Regl plus (minus) 1 -_ Reg I
Regl times operand -_ l_eg3
R_g I divided by operand -_ Beg 3
Beg I modulo operand -_ Reg 3
Regl & next modulo 521-_Reg 3. (This is a special; fast,
instruction, as it is needed to determine KM module nunber
from EM address. Estimated, 4 clocks.) (Note the absence
_^_ _w_.v"_ A Div_I2 will be built into the address path
to CN buffer, taking no time. The 1.8% holes left this way
in memory can be addressed by a different set of address
computations; they will be in a logically disjoint address
space. )
%
C-21
IADDX(M,L),
ISUBX(M,L)
IMUIX(M,L)
IDIVX(M,L)
IMODX(M,L)
IADDD, ISUBD
ISH(C,S,N)(L)
ISH(C,S,N)D(L)
IOR(M,L),
IAND(M,L),
IIMP(M,L),
IXOR(M,L)
INOT(M,L)
IDL
IADDL
IMOVE(M,L)
IDM/K_E(M,L)
IPNO
IPAK3M
IUPK3M
IPAK3F
Reg I & next plus (minus) operand -_ Reg 3 & next
(Double-length and single-length operands combined into
a doublelength result)
Reg I & next times operand -_ Peg 3 & next
Beg I & next dlv_de_ by operand -m Peg 3 & next
Reg I & next modulo cnerand -m Reg 3
Reg I & next plus (minus) Reg 2 & next -_Reg 3 & next
Shift Reg I end-around (or end-off, or ntm_eric with
sign-bit fill if right or zero fill if left) by the dis-
tance shown by the operand (positive is shift left, to
coincide with the requirements of ntlneric shifts)
Shift, as above, except double-length. Reg 1
& next.
Reg I OR (or AND, implies, exclusive OR) operand -mReg 3
NOT operand -m Reg 3
Literal (32 bits) -_ Reg & next
Beg & next plus literal (32 bits) -_ Peg & next
Operand -m __meg2
Operand -_Reg 2 & next (if operand is register, it is a
doubl-e length register)
Processor number (wired into backplane) minus "sparebit"
-_ Reg. Sparebit = 1 if processor above the spare location,
=0 if processor below. Leading two bits are cabinet ntmber,
and are not involved in the subtraction, since each cabinet
has one spare.
Reg I & Reg 2 & Reg 3 -m memory
Memory -_ Reg I & Reg 2 & Reg 3
Reg I & Reg 2 & Reg 3 --m fl. pt. reg. (because of
instruction format limitations, not all three int. Beg. will
be explicitly addressed, one or two of them will be "next"
int. Reg.
C-22
!
!
_J
IUPK3F
IS_0RE
IDS_0RE
ILT(M,L) ,
ILE(M,L) ,
I_(M,L) ,
IGE(M,L) ,
IEQ(M,L) ,
INE(M,L)
IDLT, IDGT,
IDEQ, IDNE
IBIT(L)
CN Buffer
FSTOREM
ISTOREM
IDSTOREM
I3STOREM
MSTOREM
FI. pt. Reg. -_ _egI _ Reg2 & Reg3
32 zeroes & Reg -_memory
16 zeroes & Reg. & next -P memory
If Reg I test LT (or LE, GT, GE, _Q, NE) operand, then
GOTO branchaddress
If Regl & next tests LT (or GT, EQ, or NE) to Reg 2 &
next, then GOTObranchaddress. (Reversal of registers
provides the relatlons.GE, and .LE..)
If any bit of Reg ANDedwith operand is ONE, GO_O
branchaddress
Wait for CN buffer to become NOT "busy". Send int.
Reg. 1 (EM module number) and int. Reg 2 & next
(_4 address) to CN buffer address portion, send fl. pt.
Beg 3 to CN buffer data portion. Mark CN buffer "busy".
(Following this instruction, CN buffer will story "busy"
until an acknowledge is received from the EM module, and
the buffer contents transmitted. Buffer will then be NOT
"busy" and NOT "full". The processor instruction execution
does not wait for any of this to happen. )
Same as FSTOREM except substitute int. Beg 3 for fl. pt.
Reg 3 •
Same as FSTOREM except substitute int. I_ 3 & next for
fL pt. _eg3.
Same as FSTOREM except substitute int. Reg 3 &Int Reg 4
& int. Reg 5 for fl. pt. Beg 3. Format limitations will
probably force the use of implicit addresses for Reg 4 and
Reg 5. They are likely to be the next two after Reg 3.
Same as FSTOREM except substitute memory for fl. pt. I_ 3.
(Note the asymmetry between STOR_M and LOADEM. In LOADEM,
the selection of destination is separated from the EM
address operation, in order to allow the compiler to optimize
the sequencing of instructions. In STOREM, the instructions
are combined in order to save code space. )
C-23
mm_Q
EMFILL
LOADEM
LOCK_
FREM
IREM
(Formerly "FETCHEM" and "BDCS_', but with the r_ew CN these
initiating actions are the same for both, i.e., just one
instruction) Wait for "I got here" to be reset, if up.
If CN buffer is "busy", wait for CN buffer to become NOI _
"busy". Raise "I got here". (Later, data will arrive in
the CN buffer, which will then be marked "FULL", and the
data can be read by any of the -REM instructions. Depend-
ing on whether the coordinator executed a FETCH_M or a
BDCS'f instruction, that data will have arrived from EM or
from the coordinator itself.)
(Formerly "HVST" and "SHIFTN", but with the new CN these
initiating actions are the same for both, just one
instruction.) If CN buffer "busy'*, wait for NOT '*busy*'.
If "I got here" is up, wait for "go" to reset it. Raise
"I got here", load CN buffer (datapart) from fl. pt. l_g,
and set CN buffer to "busy". (Following this instruction,
the coordinator will set the CN, to a "broadcast" condition
if HVST, or to a "wraparound" condition if SHIFCN, and
move the data from the CN buffer. If SHIFCN is in the
coordinator instruction stream, then the compiler will have
inserted some form of -REM instruction later on in the
processor instruction stream to read the now "full" CN
buffer. Other sources of data are expected to be used so
seldom that instructions to HVST or swap data to and from
integer registers and memory are judged to be a waste of
decoding complexity. )
Send Peg I (EM module no. ) and l_eg2 & next (EM
address) to CN buffer, after waiting for TN buffer to
become NOT "busy". CN buffer will now become "busy"
until data arrives from EM, whereupon CN buffer becomes
"full". Fetch next instruction without waiting.
Send Reg I (EM module no. ) and Reg 2 & next (I_4
address) to CN buffer, after waiting for CN buffer to
become NOT "busy". CN buffer now becomes "busy" until
data arrives from EM, whereupon CN buffer becomes "full".
Processor does not wait in this instruction beyond the
loading of the CN buffer. EM module will set the least
significant bit of the word in memory to ONE after
transmitting the previous contents to the CN buffer.
(Used for inter processor cooperation via EM independently
of the coordinator. )
wait for CN buffer NOT "busy". If now NOT "full", error
interrupt. CN buffer -m FI. pt. Beg. Mark buffer NOT
"full" and NOT "busy".
Same as FREM except CN buffer -m Int. _.
%
C-24
IDREM
I3REM
MRF_
ITIX
ITIXI
ITIXL
IJUMP
ICALL
IRETURN
PUSH
POP
TOS(L)
WAIT
STOP
HELP
lINT(L)
Same as FREM except CN buffer -_ Int. Reg & next
Same as FRI_4 except CN buffer -_ Int _ & next & next
Same as FREM except CN buffer -m memory.
First; R_g I ÷ Reg 3 -_ Reg I. Then, test Beg 1
against Reg 2, test for greater if I_eg3 is positive,
for less if Reg 3 is negative. If test succeeds, GOID
branchaddress.
Same, except an implied literal value of +I substitutes for
Reg3
Same, except an actual literal substitutes for Reg 3
GOTO branchaddress
Subroutine entry. Push subroutine stack. Parameters and
new working area are relative to the new stack address
pointer.
Subroutine return. Pop subroutine stack.
Push subroutine stack, do not change PCR (diagnostics).
Pop subroutine stack, do not change PCR (diagnostic use)
Set stack address pointer and the word pointed to new
values found in Regl, Beg 2 and in operand. (Stack
mechanism involves not only a stack address pointer, but
also return information and address bound(s) in word 0
relative to that pointer.)
Wait for reset of "I got here" (if it is up). Raise
"I got here". Wait for "go" before fetching next
instruction.
Wait for reset of "I got here". Reset "enable". Raise
"I got here". Resetting of "enable" disables all further
instruction fetching.
Same as S%DP plus raise "interrupt" line to coordinator.
Interrupt register AND operand -- Reg 3. Interrupt register
AND NOT operand -_ interrupt register. Operand is _2
or literal.
C-25
ISMaSK(L)
n_K(L)
ICALLI
IRETI
Interrupt mask register OR operand -_ interrupt mask
register
Interrupt mask register AND operand -m interrupt mask
register
_ter interrupt mode
Return from interrupt
%
i _', C-26
k.
i
I
i
RES_
HALT
FILLM
FILLME
FILLR
READR
READM
Table C.l, part 2
Processor operations induced by eo_mmnds issued by the
coordinator.
l_set "enable" immediately. Do not wait for current
instruction to finish. Beset "busy" in CN buffer,
Beset "I got here".
Beset "enable" only. Allow current instruction to complete.
Load word in CN buffer into processor memory. Increment
memory address by I. (MAR has previously been loaded)
Same but conditional on "enable".
Load register fr_CNbuffer. Register address will
follow this code on the command lines.
Same as FILIR except conditional on "enable".
/_dress processor. Beset "enable".
CNbuffer against proeessor ntmber.
"enable".
Check contents of
If matches, Set
Transmit contents of register to CN buffer. Register
address will follow this code on the conmmnd lines.
Read word from memory and transmit to L_ buffer. Increment
memory address by i. (Register addresses will include
registers not addressible by the address fields in the
processor instruction set, such as PCR and memory address
register. )
C-27
Arithmetic
CADD(N, L),
CSB(N,L)
CADDI
CSUBI
CMUL_N,L)
CDIV(N,L)
CMOD(N, L)
CMOD521
CSH(C,S,N)(L)
CAND(N,L),
COR(N,L),
CIMP(N,L),
C(OR(N,L)
CNOT(N,L)
CMOVE(N,L)
CDL
CADL
CSTORE
CGT(N,L) ,
OGE(N,L) ,
CLT(N, L) ,
CLE(N,L) ,
CEQ(L,N) ,
CNE(N,L)
CBIT(N,L)
TABLE C.I, part 3
Coordinator Instruction Set
Reg I plus (or minus) operand -_ Reg 3.
"N" to designate coordinator memory)
Reg I plus i -_ Reg 1
Beg I minus 1 -_ Reg 1
Beg I times operand -_ Beg 3
Reg I DIV operand -m Reg 3
Reg I module operand -m Reg 3
(Note the use of
Beg I modulo 521 -_ Reg 3. (Substantially faster than
CMODL with literal = 521)
Shift end-around (or end-off, or n_eric) the operand in
Reg I by the distance shown in operand (reg2 or
literal)
Beg I AND ( OR, implies, exclusive OR) operand -- Reg 3
NOT operand -m Beg
O_rand -_ Reg
32-bit literal -_ Reg
Reg I plus 32-bit literal -)Reg 1
Reg -_ memory
Test Bsg I for GT (or GE, LT, LE, BQ, NE) against
operand, if test is true, GOTO branchaddress.
If any bit of Beg AND operand is ONE, GOTObranchaddress.
%
i
C-28
i
!
Other Branch Controls
C_T.X First: Regl + Reg3 -m Reg3. Then, test Regl
against Reg2, test for greater if I_ 3 is positive,
for less if Beg3 is negative. If test succeeds, GOlD
branchaddress.
CTIXI Same, except an implied literal value of +i substitutes
for Reg3.
CTIXL Same, except an actual literal substitutes for Reg3
C2UMP GOTO branchaddress
CCALL Call subroutine, push subroutine stack.
CREKmN Return from subroutine, pop subroutine stack.
CPUSH Same push-stack action as CCALL, but do not change program
counter.
CPOP Same pop-stack action as CRETUI_, but do not change program
counter.
CTOS(L) Change stack pointer by loading it with operand
CRETI Return from interrupt
CCALLI
_ter interrupt mode.
Other
CLOADEM Fetch to RegI from EM. _4 address is in Reg2, _4
module no. is in I_3. (Separation of _4 address and EM
module no. permits accessing of both address spaces within
the EM. Note that the "EM address" will be stripped of
its last 9 bits before being transmitted to the EM as an
"address within module". )
Same as CLOAD_4 except that the EM module will set the
least significant bit of the word in memory to ONE after
fetching the word sent to the coordinator.
CLOADEMN(L) Fetch N words from EM: store to coordinator memory.
Memory address is in instruction. EM address is in Reg2,
EM module no. is in Reg3. N is BegI or literal.
CSTOREM Store RegI to EM. EM address is in Reg2, module no.
Ir_Reg3.
C-29
CSTOI_RN(L) Store N words from memory to EM. _ address is in Beg2,
module no. in Beg 3. N is Peg I or literal (actually,
countfield)
CrNT(L) Interrupt register AND operand -_ Reg 3. Interrupt
register AND NOT operand -_ interrupt register. Operand is
Rag 2 or literal
_ASK(_.) Interrupt mask register OR operand -m interrupt mask
register.
C_SK(L) Interrupt mask register AND operand -_ interrupt mask
register. Any interrupt bit so unmasked causes interrupt
when ONE.
(Note: The instructions in the coordinator up to this
point represent functionally a subset of the processor
capabilities. One possible implementation of them would
be to use a copy of the processor as most of the coordina-
tor. We believe that the coordinator needs 32-bit integer,
and needs more integer registers, too often for this to be
a good idea. )
(The following instructions represent coordinator capa-
bilities which are not needed in the processor. Indeed,
one of the reasons for having a separate coordinator is
so that these functions need not be replicated 512 times,
once per processor, nor do the processors require the
connectivity to the points (D_ controller, host, etc.)
that these functions imply.)
Processor Cooperation
FETCHEM From EM address in Reg2, and EM module no. -in Reg3, cause
the given _4 module to cycle, and the result broadcast to
the CN buffer of all processors. Start of instruction will
wait on "All processors ready" and "go" will be issued at
an appropriately delayed time.
S_I_CN(L) Wait for "All processors ready". Send "wraparound" command
to CN level N, _here N is found in Reg or literal. Send
"G °" •
LOOP Wait for "all processors ready". If NOT "any processor
enabled", set the "enable" bit of all processors, and exit
the instruction. If "any processor enabled", issue "go"
and GOTO branchaddress.
C-30
{i
SYNC Wait for "all processors ready". Issue "go"
EDCST Wait for "all processors ready". Set CN to "broadcast"
mode, last 48 bits of Beg & next to CN buffers of all
processors. Issue "go".
BDCS_N Wait for "all processors ready". Set CN to "broadcast"
mode, send word fetched frcm coordinator memory thru CN to
all CN buffers. Memory address has normal address format.
HVST Wait for "all processors ready". Set CN to "harvest"
mode, contents of all CN buffers that are "full" are
cc_bined (ORred is acceptable; the actual formula for
combining is logic designer's option) and transmitted to
the last 48 bits of Beg I & next.
PINT If "any processor in interrupt mode". GOTO branchaddress
Actions Impo_ on Processors
UBDCST Send N words to processor. N in Reg I. Words taken from
successive addresses in coordinator memory starting at
address given in instruction. (Processor will have
previously been put into a waiting or NOT "enable"d state,
and its MAR loaded with the starting address in PM.)
UBDCSTE Same, except acceptance of data is conditional on "enable"
bit of processor.
USETP Send contents of Reg I to processor register whose
address is in }_eg2. (Used for initializing PCR, setting
MAR, as well as for transmitting ordinary data. )
USETPE Same except conditional on "enable" bit of processor.
USETPO Same as USETP except that "enable" bit of all processors is
turned on at end of instruction.
USETPEO Same as USETPO except that acceptanee of data is conditional
on previous state of enable bit.
HALTP 1_set "enable" of every processor.
S_OPP Reset "enable", "I got here", and "busy" of every processor.
Processors will cease executing in mid-instruction.
TESTP If "all processors ready", GOTObranchaddress.
RFADP Word from processor register addressed in Beg I is brought
back to Reg 2 and the register following Reg 2.
C-31
%
READPM
READPMN
PROC
TESTE
SPARE
SETC_(t)
TIOM
STATUS
TIOH
HOST
SCLOCK
RCLOCK
Word from processor memory (at address set by MAR of
processor) is brought back to Reg 2 and the register next
after Beg 2.
N words from processor memory, starting at the address
found in this instruction. _his and previous two instruc-
tions are conditional on the "enable" bit.
Wait for "All processors ready _ . Send ADDR commm%d with
contents of Reg I as the processor n_nber.
If "any processor enabled", GOTO branchaddress.
Change the designation of spare processor, or of spare EM
module. _here are four registers designating spare proc-
essor, and four registers designating spare EM module.
These registers are readable with the CMOVE instruction.
Set CNcontrols to bit pattern found in register (literal).
This command modifies CN function for diagnostic
purposes, such as restricting access to one or the other
sheet of a two-sheetCN.
l_g I & next transmitted to DBM controller as control word.
Status word of DBM controller fetched into Reg I & next
(Status will be same as control word, except word count
will be decremented to current state, and a field of status
bits may have been changed by the D_M controller. Format
TBD.)
Beg I & next transmitted to host-readable register.
Interrupt host.
Read host read and writable register into Beg I & next.
Transfer _ to real-tlme clock. Clock decrements at a
fixed rate, TBD, causing interrupt when it decrements past
zero. Setting the clock resets the interrupt bit, if up.
Transfer contents of real-time clock counter to Reg.
C-32
!i
n
c__'
n
Mnemonics
FADD, FSUB
FADDM, FUSUBM
FADDL, ASUBL
FMUL
FMUiM
FMULL
FDIV, FDVR
FDIVM, FDVI_
FDIVL, FDVRL
FMAD, FSUB
FMADM, FSUBM
FMADL, FSUBL
FSSQ
FSSQM
FADE_
FADEXL
FMOVEX
FMOVEX5
FABS, FN_G
_%BLE C.2
PROCESSOR INSIWJC_IONS
Half Proc,
or Clock
Full Count
Word
% 6
1 9
1 6
½ 9
1 i0
1 9
½ 44
1 47
1 44
½ ii
1 14
1 ii
½ 21
1 24
½ 2
2
½ 2
½ 2
½ 1
Int. F.P. Mere. Min.
Unit Unit Busy CF Buf.
Busy Busy Busy
0-6
0-1 3-9 0-3
0-6
0-9
0-I 3-12 0-3
0-9
0-44
0-1 3-47 0-3
0-44
0-Ii
0-i 3-14 0-3
0-ii
0-21
0-1 3-24 0-3
0-2 0-2
0-2
0-2 0-2
0-2
0-I
C-33
¢. "
TABLE C.2
PROCESSOR INSTI_CTIONS (Cont)
Mnemonics Half Proc. Int. F.P.
or Clock Unit Unit
Full Count Busy Busy
Word
FABSM,
FADDX FSUBX
FADDXM, FSUBXM
FADDXL, FSUBXL
FMHLX
FADDD, FSUBD
FMULD
FL (48-bit, only 1% word format)
FMOVE
FMOVEM
FMOVF_
FPAKM
FPUPKM
FSTORE
FIX, FIXF, FIXC
FIXD
FINFLZ
FIXEX
FMT
1 4
% 7
1 i0 0-i
1 7
% 15
% 7
½ 22
1% 4
% i
1 4 0-i
1 1
1 9 6-7
1 5 0-1
1 3 0-i
½ 4 3-4
% 5 3--s
% i
% 3 0-3
% 7
3-4
0-7
0-10
0-7
0-15
0-7
0-22
1-4
0-1
3-4
0-1
0-6
2-5
0-3
0-3
O-3
0-1
0-1
0-7
C-34
Memo
Busy
0-3
0-3
O-3
6-9
0-3
0-3
Mino
CF Buf.
Busy
,TABLE C.2
PROCESSOR _CTIONS (Cont)
Mnemonics Half Proc.
or Clock
Full Count
Word
Int. F.P. Mere. Min.
Unit Unit Busy CF Bur.
Busy Busy Busy
FMTI
FLOAT
FLT, FLE, FGT, FGE (branch)
FLTM, FLEM, FGTM, FGEM (branch)
FLTL, FLEL, FGTL, FGEL (branch)
FEQL (branch)
FLTD, FGTD (branch)
SETFL, SETZ
IADD, ISLe, IADDI, ISUBI
IADDM, ISUBM
IADDL, ISUBL
IMUL
IMLKM
IMULL
IDIV, IMOD
IDIVM, IMODM
IDIVL, IMODL
IMOD521
IADDX, ISUBX
%
%
%
1
1
1
%
%
%
1
1
%
1
1
%
1
1
%
%
5 0-5
3 _i 0-3
2 1-2 0-2
5 _5 95 _3
2 _2 0-2
3 _2 0-3
3 2-3 0-3
1 0-i
1 0-i
4 0-4 O-3
1 _I
9max _9
12max 0-12
9-max 0-9
16max 0-16
19max 0-19
16max 0-16
4 0-4
2 0-2
0-3
0-3
C-35
TABLE C.2
PRDCESSOR INSTRUCTIONS (Cont)
Mnemonics Half Proc.
or Clock
Full Count
Word
Int. F.P. Mere. Min.
Unit Unit BL_sy CF Buf.
Busy Busy Busy
IADDXM, ISUBXM
IADDXL, ISUSXL
IMULX
IMULv_
IMULXL
IDIVX, Ib_3DX
IDIVXM, IMODXM
IDIVXL, IMODXL
IADDD, ISUBD
ISn(C,S,N) U')
ISH(C,S,N) D(L)
IOR, IAND, IXOR, IIMP
IOI%M, IANDM, IXORM, IIMPM
IORL, IANDL, IXORL, IIMPL
INOT, IMOVE, ITOS
INOTM, IMEVEM
INOTL, IM(_/EL, ITOSL
IDL, IADL
IDMOVE
1 5 0-5
1 2 0-2
½ 17max 0-17
1 20max 0-20
1 17max 0-17
½ 32max 0-32
1 35max 0-35
1 32max 0-32
½ 2 0-2
½ 2 0-2
½ 5 0-5
% i o-1
1 4 0-4
1 1 0-1
% i 0-1
1 4 0-4
1 1 0-1
1 2 0-2
½ 2 0-2
0-3
O-3
0-3
0-3
C-36
Mnemonics
TABLE C. 2
P_OCESSOR INSTRHCTICNS (Cont)
Half Proc.
or Clock
Full Count
Word
Int. F.P. Mem. Min.
Unit Unit Busy CF Buf.
Busy Busy Busy
IDMOVEM
IDMOVEL
IPNO
IPAK3M
IPUK3M
IPAK3F
IUPK3F
ISTORE
IDSTORE
ILT, ILE, IFT, IGE, IBIT (branch)
ILTM, ILEM, IGTM, IGEM (branch)
ILTL, ILEL_ IGTL, IGEL, IBITL
(branch)
IEQ, INE (branch)
EIQM, INEM (branch)
IEQL, INEL (branch)
IDLT, IDGT (branch)
IDEQ, IDNE (branch)
FSTOPam
1 5
1 2
½ 2
% 6
% 6
% 4
% 4
1 4
1 5
1 3
1 6
] 3
1 4
1 7
1 4
1 4
1 6
% 3
0-5
0-2
0-2
0-4
0-6
0-4 3-4
0-4 0-i
0-2
0-3
0-3
0-6
0-3
0-4
0-7
0-4
0-4
O-6
0-3 2-3
0-3
3-6
0-3
1-4
2-5
0-3
0-3
0-9
C-37
TABLE C.2
PROCESSOR INSTRUCTIONS (Cont)
Mnemonics Half Proc. Int. F.P.
or Clock Unit Unit
Full Count Busy Busy
Word
IS_ORF24, IDSTOREM % 3 0-3
I3STOREM % 3 0-3
MSTOREM 1 4 0-4
FRREQ, _IFILL % 1
LOAD_4, LOCYd_ ½ 3 0-3
_PaM % 2
IP_M % 2 1-2
IDREM ½ 3 1-3
I3REM % 4 I-4
MREM 1 4 1-2
ITIX, ITIXL, ITIXL (branch) 1 3 0-3
IJUMP (branch) % 2 0-2
ICALL, IREq_RN, PUSH, OPO, IRETI ½ 30
(TBD)
_%IT, S%IOP, HELP ½ 4
ICALLI % i
lINT % 2 0-2
IINTL ! 2 0-2
ISMASK, !RMASK ½ 1 0-i
ISMASKL, IRMASKL 1 1 0-i
i-2
C-38
Mere.
Busy
Min.
C2 Bur.
Busy
0-9
0-6
0-3 1-8
0-i
0-12
0-2
0-2
0-3
0-4
1-4 0-4
0-4
JCOORDINATOR INSTRUCTIONS
Mnemonics Half Coord Arith
or Clock Unit
Full Count Busy
Word
Mem. Min.
Busy CN Buf.
BUsl
1
CADD, (:SUB, CADDI, CSUBI,
CSH(C,S,N) (L), CAND, COR,
CIMP, CXOR, CNOT, CMOVE,
CTOS
% 1 0-1
CADDN, CSUBN, CANDN, COI_N, CIMPN, 1 4 1-4
CXORN, CNOTN, CMOVEN
CADDL, CSUBL, CADL, CDL, CANDL, 1 1 0-1
CORL, CIMPL, CXORL, CN(_,
CM(A_L, CTOSL
CMUL ½ 16max 0-16
CMULN 1 19max 3-19
CMULL 1 16max 0-16
CDIV, CMOD ½ 32max 0-32
CDIVM, CMODM 1 35max 0-35
CDIVL, CMODL 1 32max 0-32
CMOD521 ½ 4 0-4
CSTORE 1 3 0-3
CGT, OGE, CLT, CLE, CBIT (branch) 1 3 0-3
O3TN, CG_N, CLTN, CLI_, CBITN 1 6 0-6
(branch)
CGTL, CGEL, CLTL, CLEL, CBITL 1 3 0-3
(branch)
0-3
0-3
0-3
0-3
0-3
C-39
Z
COORDINATOR INSTR_IONS (Cont)
Mnemonics Half Coord Arith
or Clock Unit
Full Count Busy
Word
CEQ, CNE (branch) ½
CEQN, CN_ (branch) 1
CEQL, CNEL (branch) 1
CTIX, CTIXL, CTIN& (branch) 1
CHUMP (branch) ½
CCALL, CR_'I_JRN,CPUSH, CPOP, %
CRETI
CLOADEM, CLOCKEM 4
CLOAD_(L) 1
CSTOREM ½
CSTOREMN(L) 1
4
7
4
3
2
30
(_D)
13
9+
12N
3
9N--6
cI_ ½ 2
CINTL 1 2
CSMASK, CRMASK ½ 1
CSMASKL, CRMASKL 1 1
FETCHEM ½ 13
SHIFCN, SHIFCNL ½ 9
LOOP, PINT, TESTP, TESTE (branch) ½ 2
0-4
0-7
0-4
0-3
1-2
Mem. Min.
Busy CN Bur.
Busy
0-3
0-13 0-12
0-(7 13-(9 0-(9
+I2N) +I2N) +I2N)
0-3 0-7
0- 0-9N
(9N-6)
0-2
0-2
0-1
0-1
0-3 0-12
0-3 0-9
0-2
C-40
ii
I Mnemonics
COORDINATOR INSTRUCTIONS (Cont)
Half Coord Arith
or Clock Unit
Full Count Busy
Word
__m. Min.
Busy CN Bur.
Busy
SYNC
RDCST
HVST
BDCSTN
UBDCST, UBDCSTE
USETP, USETPE, PROC
USETPO, USETPEO
HALTP, STOPP
READP, READPM
READPM_
TIOM, TIO_, STATUS, HOST
SCIDCK, RCIf_K
½ 2 0-2
% 9 0-5
% 9 o-9
1 12 3-8
1 7+6N 0-(7
+6N)
½ 9 0-4
% n 0-4
% 15 0-15
1 15+ 0-( 15
6N +6N)
% 2 o-2
% 3
(TBD)
0-3
C-41
iAPPENDIX D
RELIABILITY, AVAILABILITY, AND MAINTAINABILITY PROGRAMS
From a system engineering viewpoint, the design of a reliable and
maintainable digital computer system encompasses many interdisci-
plinary technical trade-off decisions. This appendix describes
the computer programs called DESIGN and CONFIGURE which have been
developed by the Burroughs Corporation to focus attention on
critical Reliability, Availability, and Maintainability (RAM)
design factors that have been repeatedly observed to dominate the
frequency of abnormal system interruption and the duration of
downtime in fault-tolerant computer systems. In analyzing the RAM
characteristics of the Flow Model Processor (FMP), the DESIGN
Program was used to pinpoint critical factors pertinent to the
failure, repair, and recovery processes of the FMP that require
concentrated design attention as the design progresses. The
CONFIGURE program was used to predict the performance of the
Support Processor and File Management Subsystems.
The following paragraphs describe the DESIGN and CONFIGURE pro-
grams in terms of the computer system models applied to the FMP
and the NASF and the computations performed. Salient theoretical
and practical assumptions associated with the mathematical model
utilized and definitions of all input parameters and computed
results are discussed to aid in understanding the analysis
performed and interpreting computed results. Definitions of terms
used in this appendix are presented in Section D.4.
D.I COMPUTER SYSTEM MODEL
Traditionally, in mathematical analyses of repairable redundant
systems, it has been common to assume that system failure occurs
due to the depletion of hardware resources when an active hardware
element fails before the previously active redund@nt hardware
element(s) is repaired. Although this conventional failure and
repair cycle type model has been applied successfully to investi-
gate the hardware availability aspects of certain types of redun-
dant systems, it has been of little practical value in predicting
the operational RAM characteristics of fault-tolerant computer
systems in which hardware elements operate under software control.
In an operational environment, the failure of a computer system to
operate continuously frequently occurs for reasons other than the
depletion of hardware resources due to permanent type failures
which require repair actions. Common causes of computer system
interruption and downtime include intermittent failures and the
inability to automatically recover from certain single critical
hardware failures. Since an accurate reliability estimate must
take into account all applicable sources of system interruption
and downtime, to the extent possible, Programs DESIGN and
D-I
CONFIGURE have been developed to treat overall computer system
behavior in terms of hardware subsystems operation under software
control as depicted in the availability block diagram of Figure
D.I. From a reliability point of view, each of the critical
subsystems in Figure D.I must operate successfully in order
sustain proper system operation.
As shown in Figure D.I, any number of independent hardware sub-
systems operating under software control can be defined to take
into account as many functions as required. The subsystem model
is based on the premise that if a redundant hardware element
fails, the particular subsystem involved may be interrupted for a
short time to effect reconfiguration. After a short delay, the
subsystem is restored to operation, and continues to operate while
the failed hardware element is being repaired. However, if more
than the specified allowable number of hardware elements are down
for repair or if a critical hardware element has failed, then the
subsystem is down until the appropriate repair has been effected.
As shown in Figure D.1, the failure of any subsystem breaks the
critical success path, causing the system to fail.
SUI3S¥STEM 1 SUBSYSTEM M
ELEMENT ELEMENT
1 1
HARDWARE HARDWARE
ELEMENT ELEMENT
IN I SOURCE L ,
N ]INTERRUPTION[ N
k R1/N1 . RM/NM
sYSTEM
J
(HARDWARE OPERATING UNDER SOFTWARE CONTROL)
Figure D.I. Computer System Availability
Block Diagram
D-2
!Computation of Mean Up Time (MUT), Mean Down Time (MDT), and
Availability for each of the specified critical elements and
subsystems is performed using the mathematical model discussed in
the following paragraph. System MUT, MDT, and Availability are
then computed based on the successful operation of all subsystems
using conventional methods. The assumption associated with this
system decomposition technique requires only that the system be
composed of independent subsystems, each of which can be regarded
as having two possible outputs (working and failed) and that it is
possible to identify a certain set of subsystem states as "working
states" and the remaining states as "failed states".
D.2 MATHEMATICAL MODEL FOR THE DESIGN PROGRAM
The mathematical model employed in the DESIGN Program is a dis-
crete-state continuous-time model called a Markov process. As
with any type of Mark or model, the underlying assumption of this
process is that the transition probability Pij from any state i to
any state j depends only on the states of i and j and is
completely independent of all past states except the last one.
The transition probabilities must obey the following two rules:
- The probability of a transition in time _t from one state
to another is given by Z(t) _ t where Z(t) is the hazard
associated with the two states in question. If all Zi(t)'s
are constant, as assumed herein, the model is called
homogeneous.
- The probabilities of more than one transition in time _t
are infinitesimals of higher order and can be neglected.
These properties and assumptions are quite widely accepted as
being appropriate to modeling the failure and repair cycles of
computer systems.
D.2.1 Markov Graphs
Figure D.2 is a Markov graph dipicting the transitions between
states for each of the subsystems defined in the Computer System
availability Block Diagram of Figure D.I. In Figure D.2, shaded
states represent subsystem failure, and consequently system fail-
ure since all subsystems are required to be functioning properly
to achieve system success.
For hardware subsystems operating under software control, the
Markov Graph is quite complex. Therefore, the simplified Markov
Graph shown in Figure D.3 will be described first as an intro-
duction to considering the chain pertaining to the depletion of
redundancy shown in Figure D.3.
D-3
%
3 U,I UJ
_0 O: I--
Z Z
3
¥
O:
I
2_
II
J
9o
I-Z
j,-_
O-W
Z
cn W
_z
!
D-4
ii
D.2.2 Conventional Failure and Repair Cycle Model
The Markov Graph shown in Figure D.3 is typical of conventional
failure and repair cycle models for repairable redundant systems
where the mechanism for removing failed hardware elements from the
system and replacing repaired hardware elements are tacitly as-
sumed to be perfect. As shown in Figure D.3 the number of states
in the Markov Graph is a variable since each subsystem may contain
0, i, 2, 3 or more active redundant hardware devices. If, for
instance, a subsystem contains two active identical devices (n),
only one of which is required to be operating for subsystem suc-
cess (R), State L+I becomes State 3 (a DOWN state) which termin-
ates the chain since L+I=N-R+2, or L=N-R+I.
For the subsystem with one redundant device, it is common to
hypothesize that at least two device failures must occur before a
subsystem failure can occur. Normally, the Mean Time to Repair
(MTTR) of a device is very short compared with the Mean Time
Between Failure3 (MTBF); therefore, many allowable device failures
are expected to occur before a subsystem failure occurs due to a
second device failure during the time when a failed device is
being repaired. Thus, on the surface it appears that tremendous
gains are in store if sufficient redundancy is provided for criti-
cal devices in each subsystem since the probability of a second
failure during a repai_ cycle is a rare event.
°
%
_--b.
i
[
: ;k
NX (N-11X (N-2IX (N-L÷|)_
O 1 2 3 (N-R÷2)
FAILED FA|LEO FAILED FAILED FAILED
DEVICES DEVICE DEVICES DEVICES DEVICES
R: NUMBER OF DEVICES REQUIRED TO BE OPERATING FOR SUCCESS
N: NUMBER OF DEVICES AVAILABLE
_: DEVICE FAILURE RATE
/J: DEVICE REPAIR RATE
Figure D.3. Simplified Markov Graph for Depletion of Redundancy
D-5
As shown in Figure D.3, "N" subsystem devices are operating suc-
cessfully in State i. Therefore, the rate at which device fail-
ures occur is "N" times the failure rate (_) of a single device,
since all devices are required to be identical. In State 2 one
device is failed and either of the following two transitions may
occur;
- The failed device may be repaired, at rate_, before a
second failure occurs and placed back into service,
returning the subsystem to State i.
- A second failure may occur before the repair is complete,
further degrading the subsystem to State 3.
In State 2, failures occur at rate (N-I) times _since one device
is already being repaired. The rate at which failures occur in
subsequent states diminishes as shown in Figure D.3 until the
subsystem contains an inadequate number of hardware devices to
sustain acceptable functional operation. Once the subsystem is in
a failed state, it is assumed that operations cease and no addit-
ional failures occur.
Since program DESIGN is intended to investigate system design
potential, an ideal support environment is assumed in which
replacement spares, trained repairmen, documentation, test equip-
ment, etc., are all immediately available when required. As shown
in Figure D.3 the rate at which repairs are enacted when one
device is failed is_, and the rate at which repairs are enacted
when more than one device is failed is 2_ The underlying
assumptions for the coefficients of_are that only one repairman
will be assigned to a failed device'and that the maximum number of
repairmen available for assignment to a failed subsystem is two.
Thus, if one hardware device fails, one repairman goes to work;
only when two or more devices require repair are both available
repairmen busy. Normally, the probabilities associated with
degradation to states where more than one or two devices require
service simultaneously are very small.
D.2.3 Desi@n Model for Hardware Elements Operatin@ Under Software
Control
There are several critical factors involved in adding redundant
devices in computer subsystems which tend to severely reduce the
potential benefits of hardware redundancy. First, the mechanism
for automatically detecting, isolating, and switching failed
devices out of the system and adding repaired devices back into
the system is a complex interdisciplinary design problem. Also,
clocks, controllers, busses and interface circuitry between hard-
ware devices tend to contain Single Point Failure Modes (SPFM's)
which cause subsystem failures even though the subsystem is not
depleted of sufficient hardware resources. Unlike some types of
D-6
systems, both permanent type failures which require a repair
action and intermittent type failures which disappear before being
isolated must carefully be considered in designing computer
systems since either can cause abnormal subsystem interruption.
For continuous operation, the problem of performing scheduled
maintenance actions becomes an important consideration, and safe-
guards are necessary to prevent accidental system interruption
when unscheduled maintenance actions are being performed on failed
devices which cannot be physically disconnected from the system.
Referring to the Markov Graph for hardware operating under soft-
ware control in Figure D.2, it can be seen that the center portion
labeled "Depletion of Redundancy" corresponds closely to the
previously discussed simplified Markov Graph. As before, the
number of states is a variable depending upon how many redundant
devices are provided in the subsystem. Failure states for perman-
ent and intermittent type failures related to the recovery process
and SPFM's are organized in line with the labels on the right-hand
side of Figure D.2. State 5L+I in Figure D.2 provides for
considering scheduled maintenance actions in systems where
continuous operation is desired, and consideration of maintenance
errors during unschedul_d maintenance actions is factored into
states where repair actions are being performed while the system
is still operating.
Considering first only the depletion of redundancy, the Markov
Graph for hardware subsystems operating under software control is
based on the premise that if a redundant hardware device fails,
the particular subsystem involved may be interrupted for a negli-
gible time to effect automatic reconfiguration. After automatic-
ally decommitting the failed device, the subsystem is immediately
restored to operation, and the failed hardware device is then
repaired. When repair of the failed hardware device is completed,
it is recommitted to the subsystem without any discernable inter-
ruption in subsystem service. However, if more than the specified
allowable number of hardware devices are down for repair, or if a
critical hardware device has failed, then the subsystem is down
until the appropriate repair has been effected.
As previously discussed, the five primary sources of system inter-
ruption and downtime diagrammed in Figure D.I for hardware operat-
ing under software control are:
- Depletion of Adequate Resources
- Unsuccessful Recovery (Intermittent)
- Unsuccessful Recovery (Permanent)
- Single Point Failure Modes (Intermittent)
- Single Point Failure Modes (Permanent)
D-7
Starting in State i, "N" identical, independent subsystemdevices
are operating successfully. Therefore, the rate at which failures
occur (either permanent,_D, or intermittent, AI) is N times the
failure rate of a single _evice. If no redundancyis provided,
any failure causes a subsystemfailure. With redundancy, when a
failure occurs in State I, any of the following transitions may
occur.
- If the failure is a permanent type failure related to an
SPFM, no recovery is possible and there _s a transition to
State 2L+I, which requires a repair action to reestablish
subsystem operation. The rate at which SPFM repairs are
enacted is_p C. When the repair is completed in State
2L+I, there is a transition back to State I, where the
subsystem is again operating successfully with all hardware
devices present.
- If the failure is an intermittent type failure to an SPFM,
no recovery is possible and there is a transition to State
4L+I. Since intermittent failures do not require a repair
action, the device is returned to the subsystem and there
is a transition from State 4L+I back to State 1 at a rate
, which is the device manual recovery rate including the
ime required for the intermittent failure to disappear.
- If the failure is a permanent type failure and the auto-
matic recovery system is successful, there is a transition
to State 2, in which case the subsystem continues to
operate with one device decommitted from the subsystem. If
no additional events occur before the failed device is
repaired and recommitted to the subsystem, there is a trans-
ition back to State i. These transitions occur at rate/_p,
which is the device repair rate for permanent type fail-
ures. Additional events in State 2 will be discussed
subsequently.
- If the failure is a permanent type failure and the auto-
matic recovery system is unsuccessful, there is a
transition to State L+2. In State L+2, manual recovery
procedures are enacted at rate_ D and there is a transition
to State 2 where the subsystem is operating with one device
decommitted from the subsystem. As indicated above, State
2 will be discussed subsequently.
- If the failure is an intermittent type failure and the auto-
matic recovery system is unsuccessful, there is a
transition to State 3L+I. Again, since intermittent type
failures do not require a repair action, the device is
returned to the subsystem and there is a transition back to
State 1 at rate
rD
D-8
uJ
- For systems which are required to operate Cohhinuously,
State 1 provides the best opportunity for performing any
required scheduled maintenance on subsystem hardware devices.
Therefore, scheduled maintenance is restricted to being
performed only when all subsystem hardware devices are oper-
ating successfully. When a hardware device is decommitted
for scheduled maintenance, there is a transition to State
5L+I where the subsystem is operating successfully, but
depleted of one of its hardware resources. If the scheduled
maintenance is completed (rate.M) and returned to the sub-
system before an event occurs, there is a transition back to
State i. However, if an event occurs before the scheduled
maintenance action is completed any of the following may
occur :
- To State of 2L+2 if the event is related to permanent
type SPFM (Subsystem DOWN)
- TO State L+3 if the event is related to a permanent type
failure and automatic recovery is unsuccessful (Sub-
system DOWN)
- To State 3 if the event is related to a permanent type
failure and automatic recovery is successful (Subsystem
UP) provided the subsystem contains two or more redun-
dant devices, otherwise, State 3 is a DOWN state.
In State 2, subsystems with one or more redundant devices are
operating successfully with one failed device decommitted from the
subsystem. Therefore, the rate at which events related to the
number of hardware devices occur diminishes to a multiplier of
N-I. The subsystem operates essentially as described for State 1
except that the failed device being repaired may be a hazard to
subsystem operation. If the failed device is not _isconnected
from the subsystem and safeguards are inadequate, a maintenance
error could occur which brings the subsystem down. _n State 2,
the transition from state 2 to to State L+2 accounts for this
potential mode. As shown, the rate at which catastrophic
maintenance errors occur is designated as _.
D.3 MATHEMATICAL MODEL FOR THE CONFIGURE PROGRAM
I
In contrast to the traditional two-state failure and repair cycle
reliability model, the CONFIGURE program employs the three-state
model shown in Figure D.4 which enables the effects of m_nual
recovery from non-permanent failures and errors to be taken into
consideration. This separation of repair and nonrepair events is
the key to modeling the effects of intermittent failures, software
errors, maintenance errors, unisolated events, and unsucce:.:sful
automatic recoveries in close approximation to the physical system
being analyzed.
D-9
9o.
As shown in Figure D.4, when an element fails, a transition occurs
from the UP state into either the REPAIR state or the INTERRUPT
state. The rate at which these transitions occur is the reci-
procal of the element Mean Up Time (I/MUT). Variable F1 defines
the fraction of failures or errors which cause a transition direct-
ly into the INTERRUPT state. Hence, (I-F1) is the fraction of
failures that cause a transition directly into the INTERRUPT
state. Hence, (I-F1) is the fraction of failures that cause a
transition directly into the REPAIR state.
Once an element is in the REPAIR state, the only possible transi-
tion is to the UP state. The rate at which this transition occurs
is, of course, the reciprocal of the element Mean Repair Time
(I/MRT). When an element is in the INTERRUPT state, transitions
to either the REPAIR state or the UP state can occur. Variable F2
is the fraction of total interrupt events which go into the REPAIR
state rather than going directly into the UP state, and the reci-
procal of the Mean Interrupt Time (1/MINT) is the rate at which
these transitions occur.
For hardware subsystems, the subsystem is considered to be oper-
ating successfully if every element is in the UP state, and the
subsystem is considered to be down if any element causes a transi-
tion to the INTERRUPT state. The subsystem can be operating
succesfully with some of the hardware elements in the REPAIR
state. This depends on the definition of how many hardware ele-
ments of the subsystem can be in the REPAIR state with the sub-
system still capable of performing its intended runcti,m. For
critical subsystems, the subsystem is considered to be opera_
ing successfully only when no repair action or interruption are in
process. Thus. for unisolated events, operator errors, and main-
tenance errors which require no repai_ action, transitions From
th{: UP state go directly [n£L the INTERRUPT state. In this cm;e,
restoration of system cperat[on is accomplished by a manual recov-
ery action. Software errors which disappear follow this sa_e
pattern. However; if a software patch is required, the repair
state becomes involved in a manner analogous to the situation
discussed with respect to permanent type hardware failures.
The summary table provided below the transition diagram outlines
conditions in states S I, S 2, and S 3, and defines the type of
recovery required for the specified conditions.
The critical assumptions associated with the derivation of the
state probability equations shown in Figure D.4 are:
- Failure and repair hazards are assumed to be constant,
which is equivalent to stating that individual elements are
assumed to fail in accordance with the negative exponential
distribution, and the times to repair are also exponentially
distributed.
b-10
f>
1.F__2
MINT
%
TRANSITION DIAGRAM
I/MUT
¢::> .
1-F__1
MUT
F2
MINT
-,_,.. ,,
F I/MUT
J
P
l/MINT
STATE SUBSYSTEMS ELEMENTS RECOVERY
S 1 UP N UP NONE
$2 UP UP ;_ R AUTOMATIC
DOWN UP <, R MANUAL
$3 DOWN ; DOWN MANUAL
PU _', 1
1 + F1 MINT + ( (1 - F1) + F1 F2) MR.._T
t MUT MUT
Pi_ F__ _U
MUT
PR _ ((1 - FI| ÷ F1 F21 MRT PU
MUT
,I MUT _- TOTAL TIME
NE
I MRT _ TOTAL REPAIR TIME
NRE
'! MINT = TOTAL SYSTEM INTERRUPT TIME
NIE
t Ft = FRACTION OF TOTAL FAILURESWHICH
- CAUSE AN INTERRUPT
F2 = FRACTION OF TOTAL INTERRUPTS
i WHICH REOUIRE A REPAIR ACTION
Figure D.4. Three-State
PROBABILITY OF BEING IN THE UP _'S3-ATE
= PROBABILITY OF BEING IN THE INTERRUPT STATE
= PROBABILITY OF BEING IN THE REPAIR STATE
MUT _ MEAN UPTIME
MRT _. MEAN REPAIR TIME
MINT " MEAN INTERRUPT TIME
NE = NUMBER OF OBSERVED EVENTS
NRE _ NUMBER OF REPAIR EVENTS
NIE = NUMBER OF INTERRUP'r EVEN'fS
Model -- CONFIGURE Program
D-ll
- Each subsystem element is completely independent of other
elements in the subsystem.
- Each subsystem element is identical in any given
subsystem.
D.4 DEFINITIONS
The definitions summarized below are provided for reference to aid
in interpreting computed results.
Do 4.1 _uts
The following definitions pertain to Program inputs. Each defi-
nition describes a program variable. The symbol used is given in
brackets following the definitions. In cases where these symbols
are reciprocals of the various transition rates discussed, an
equivalence relationship is given which correlates the symbols in
Figure D.2 with the input data symbols.
- DEVICES REQUIRED. Minimum number of identical subsystem
devices required to be working for acceptable subsystem
operation (R).
- DEVICES AVAILABLE. Number of identical subsystem devices
provisioned for active subsystem operation (N).
- TIME BETWEEN FAILURES (PERMA_°_NT). Time interval from an
instant when a repairable device is working to the next
intermittent type device failure which requires a manual
recovery action (mean: MTBF(P) = I/_p).
- TIME BETWEEN FAILURES (INTERMITTENT). Time interval from
an instant when a repairable device is working to the next
intermittent type device failure which requires a manual
recovery action (mean: MTBF(I) = I/_I).
- SINGLE POINT FAILURES*. Percentage of total failures in a
redundant subsystem configuration which result in subsystem
failures (permanent or intermittent) even though an ade-
quate number of devizes are working (SPFM = P).
- DEVICE REPAIR TIME. Time interval from an instant when
repair of a device is initiated to readiness as an active
subsystem device, excluding waiting times for repairmen,
spares, etc. (mean: DRT = i//_p ).
- SINGLE POINT REPAIR TIME*. Time interval from an instant
when repair of a single point failure is initiated to
readiness of associated subsystem, excluding waiting times
for repairmen, spares, etc. (mean: SRT = I/_pc).
D-12
i
I
I
- RECOVERY EFFICIENCY (PERMANENT). Percentage of automatic
recovery actions from permanent type failures which are
completed successfully (without manual intervention) within
a negligible period of time (RE(P) =_p).
- RECOVERY EFFICIENCY (INTERMITTENT). Percentage of auto-
matic recovery---_tions from intermittent type failures
which are completed successfully (without manual inter-
vention) within a negligible period of time (RE(I) -_I).
- DEVICE MANUAL RECOVERY TIME. Time interval from an instant
when a system failure related to an unsuccessful automatic
recovery from a device failure (permanent or intermittent)
occurs until the system is restored to normal operation via
manual recovery procedures (mean: DMRT = i/_p).
- TIME BETWEEN MAINTENANCE ERRORS. Time interval from an
m
instant when a system recovery related to a maintenance
e!:ror is completed until the next occurrance of a main-
tenance error which causes system interruption (mean: MTBME
= iI_').
- TIME BETWEEN PREVENTIVE MAINTENANCE ACTIONS. Time inter-
val from an instant when a device has been recommitted to
active operation following a scheduled preventive mainten-
ance action unti the next scheduled preventive maintenance
action is due (mean: MTBPM = I/_M).
- TIME TO PERFORM PREVENTIVE MAINTENANCE. Time interval from
an instant when a device is decommitted from active oper-
ation (mean: MTTPM = I/_M).
- SYSTEM MANUAL RECOVERY TIME. Time interval from an instant
when a system failure related to a transient software or
operator error occurs until the system is restored to
normal operation via manual recovery procedures (mean:
SMRT- ii¥s).
* This variable is provded for convenience in preparing inita!
estimtes; if desired, SPFM's can be modeled as a single non-
redundant device in series with the associated subsystem.
D-13
D.4.2 Program Outputs
The following six useful measures of system performance are fre-
quently encountered in modeling and analyzing repairable, redun-
dant system configuratlons:
- Availability
- Mean Up Time (MUT)
- Mean Down Time (MDT)
- Mean Cycle Time (MCT)
- Mean Time to First Failure (MTFF)
- Mean Ti.me to Failure (MTTF)
Although the terminology given above is common in the field of
reliability, the precise meaning of the terms can easily become
confused. The following definitions of these terms appear in the
paper by Buzacott [i]. These definitions have been extracted
almost directly since the unified presentation in the Buzacott
paper tends to relieve much of the misunderstanding encountered
regarding the various mean times of interest and the concept of
point and interval availability in treating repairable, redundant
systems. Only the MUT, MDT, and interval availability definitions
are directly applicable to the outputs of the DESIGN program. The
remaining definitions are provided for reference to further clar-
ify the precise meanings of MUT, MDT, and interval availability.
D.4.2.1 System Availability
Let SYSTEM INITIATION be the instant when system operation begins
for the first time. The system and all of its components are
assumed to be working correctly and not to be subject to wearout
during the time interval of interest.
Let SYSTEM FAILURE be the instant when the system changes from
working to failed. Let SYSTEM REPAIR be the instant when the
system changes from failed to working.
The first group of definitions applies to the concept of avail-
ability. POINT AVAILABILITY (at time t): The probability that
the system is working at time t from system initiation. It is
assumed that at time t no information is available about system
failures and system repairs during the time interval (0, t) (sym-
bol Pw(t)) INTERVAL AVAILABILITY (at time t): The expected
proportion of the time interval from system initiation (time 0) to
time t during which the system is working. (symbol I(t)). Hence
t
/.
t4 w
D-14
D.I
It is assumed in calculating all quantities except the mean of the
time to first failure that the system has been operating long
enough for initial effects to have died away; i.e., statistical
equilibrium (or the asymptotic behavior) has been reached. The
mean :ycle time is then obtained from the ratio of the length of
some long time interval to the number of failures N in that time
interval, i.e.,
and
lim f I
MU__/.T
AV = Availability = MCT
Asymptotically, for large t, it can be shown that point availa-
bility is numerically the same as interval availability. Thus
A = t-_llm)Pw(t)l= t-_,<=llmIf(t) I
This steady-state availability is the availability calculated
herein.
D.4.2.2 System Time Interval Between Failures
The next group of definitions refer to the concepts related to the
time interval between failures. Each definition defines a random
variable. The symbol that will be used herein for the mean is
given in parentheses following the definition.
- UP TIME. Time interval from system repair to next system
f&ilure (mean: MUT).
- DOWN TIME. Time interval from system failure to next
system repair (mean: MDT).
- CYCLE TIME. Time interval from one system failure to the
next system failure. The cycle time is the sum of an Up
Time and a Down Time (mean: MCT).
- TIME TO FIRST FAILURE.
at{on (mean: MTTF).
Time interval from system initi-
D-15
The first three time intervals, up time, down time, and cycle time
apply to a system that is alternately working, failed, working,
failed, and so on. The system is said to be repaired when suffic-
ient, but not necessarily all components are repaired so that the
system is working.
D-16
4il
i
i
APPENDIX D REFERENCES
i. Buzacott, J.A.; "Markov Approach to Finding Failure Times of
Repairable System," IEEE Transactions on Reliability, Vol. R-19
No. 4 November, 1970.
D-17
APPENDIXE
FMPRELIABILITY DATA BASE
This Appendix presents the predicted results of the MTBF's and
failure rates for the elements of the FMP system (see figures E.I,
E.2, and E.3.). Three sets of results are shown, the difference
being the failure rate and SECDED improvement factors used for the
LSI memory circuits.
The form of these predictions is a hierarchical structure. Each
FMP element is listed as a level 01: the parts constituting the
elements are listed as level 02. For a better defined system, the
number of levels may increase, showing the assemblies that make up
an element as level 02, the subassemblies as level 03 and the
components in the subassemblies as level 04, etc.
For each item listed in an element, a part number and description
are provided. For the FMP, hypothetical part numbers are used.
In the case of the I.C.'s, typical parts in the generic family
assumed have been selected to represent all the I.C's used in the
various logic functions.
The quantity of each part in the structured listing is shown. The
quantity is multiplied by the quantity of the encompassing level
item. For example, where a quantity of 2 of an assembly or
element is listed and a quantity of 3 of a particular part in an
assembly or element is required, the quantity listed for this part
will be 6.
Failure rates for each individual part and the total quantity of
that part are shown and expressed in failures per million hours
(FPMH). The aggregate of these failures are used to predict the
failure rate of the element and the mean time between failures
(MTBF) in hours. The failure rates used except for the LSI memo-
ries have been developed from the guidelines in MIL HDBK 217B.
While not used in this study, columns for the spares confidence
level are shown. When used, these data indicate the maximum
number of repair actions (and therefore spare elements) required
at a specific confidence level for specified number of years and
duty cycles.
E-I
",
i:
I:
_..,.._, ,,o ,o
g., _,
.r,
I
I'i.....
,_
Z "
%
I
p __.. ,,.,....... **_ ,.., ,,, ._. _ o.,,., *_ .. ,,,,,
Z
.... _, ',._.,,:.: _.';, ,.,* ,, _.': t._..... .. _....
E-2
u,
_4
_- ° • • . * *
m,
.._ • o • o , o
  a°z)Fcm n z oe
_ :_ ...........
r_
&
0
o
o
¢
I
121
-e,I
,Q
°,-4
ral
_4
o,-,I
r,.
E-3
E-4
E-5
• ,, . ............, .......... ,..,,,._..:....._ ._,_ ,_._ .. ,._=._;.-,.;,,.;.,'_',/_ .
.,_. o • • , , . , , . * _ _,_.. , • o • o .¢ _._. , o . • o • *{
.... ,[ ,', . :- ,:, ] •
4 _
E-8
J_
_#
i:i
i
_,£
P..-7

.+ /
q
APPENDIX F
SYSTF_ THROUGHPUT AND UTILIZATION ANALYSIS
F.I SUMMARY
The study of the feasibility of the NASF would be incomplete if
only the high-performance computational engine, called the Flow
Model Processor (FMP), were considered. The facility must have
sufficient support equipment so that the FMP will not be idle due
to bottlenecks elsewhere in the facility.
Chapter 2 of this report introduced the expected operational
environment. F_c..,_e F.I shows the organization of the facility at
the level cons{dered in this initial study of the facility.
Reference can ,e made back to Figures 2.1 and 2.2 to understand
the level of tnis model. In particular, analysis to this point
does nob include the structure of the data communications, proces-
sing an l terminals local to the users. All of those capabilities
are lumi_ed onder the term "Users".
Since dra t c_pies of the system-level operational scenarios were
not available until late in the study, some of the system-level
analysis ovi,:ineily planned has not been completed. The analysis,
desct,bed in mo_e detail below, specifically considers the loading
of th_ Fir, ;_del Processor, the File System and the Support
Proc_,_sor The data transfer requirements between each of these
major system co_,_onents and to the Users are also considered.
The analysis shows that the system proposed during the Preliminary
Study [I, 2] would be inadequate to support the operational scen-
arios provided during this feasibility study. In particular, the
support _rocessor would have been a bottleneck as far as comp-
utational capability is concerned and the data transfer require-
ments to and from the support processor system were underestimated
originally. Part of the excess loading of the support processor
system was alleviated near the start of this study when the deci-
sion was made tc consider the feasibility of a system where file
management was a function supported by the file system itself
rather than by the Support Processor. _,is analysis has shown
that suppor_ of the major formatting requirements for both hard-
copy printers and, most especially, for Con_uter-Output to Micro-
film (COM) should be removed from the Support Processor to either
a peripP_ral-support processor or perhaps to the FMP itself.
F.2 MODEL AND ASSUMPTIONS USED FOR ANALYSIS
Figure F.I shows the general model used for this analysis. The
analysis performed was an operational-type analysis based on NASF
operational scenarios included in the original NASF Utilization
document [3] as updated during subsequent discussions.
F-I
+.++-+,k.+,+_'_+..- _ ....... +,+,,+ ........................................ + ................. ' ...... f_ ,.Z_,±,.<,__, + .j.,ppmm,_L+ ...............................
FLOW MODEL
PROCESSOR
FILE
SYSTEM i PROCESSOR
, SYSTEM
Figure F.I NASF System Throughput and Analysis Model
F-2
/The data presented in the scenario was in terms of Job classifica-
tions. The following cases were used to represent the various
types of use encountered during a "typical" NASF day.
I.A Method and Code Development using scaled down problems.
I.B Grid Modification.
2.A Larger code development as well as grid and result array
generation.
2.B Grid Generation.
3. Simpler simulations on a large grid (such as inviscid flows
with bounda[d layer correction).
o Typical viscous, steady flow simulations used for design,
resulting in a single solution.
5. Viscous, steady flow simulations requiring several solutions,
such as design optimizations.
6. Unsteady viscous flow simulations fo_ design applications.
7. Large _luid physics research simulations.
These cases correspond to the column headings in Table F.I.
Regardless of case, each user has a sequence of tasks to be per-
formed in order to complete his job. These tasks were generally
identified in four major areas:
A. Simulation Program Input
B. Simulation Input Data Preparation
C. Simulation (execution)
D. Output of Simulation Results
The actual detailed tasks defined in the utilization document [3]
were:
(A. Simulation Program Input)
i. Source Module Generation is the task of inputting
_slm_ion source programs into the system.
2. Source Module Editing is the task of editing source
modules as required by input or compile errors.
3. Source Module Compilation is the task of compiling
source mo--_s of s{mulation programs into object
programs.
4. Linking . is the task of collecting all the object ,,odules
which are required for a simulation and cleaning up
incomplete address binding prior to loading into the
FMP for execution.
F-3
0
(B. Simulation Input Data Preparation)
5. Configuration Generation is the task where surface coordi-
nate tables are input and surface patches are computed.
The model assumed that, in some cases, the FMP could
compute the surface patch coefficients but that the
processor local to the graphics stations could also
perform this computation.
6. Surface Grid Generation (Not separately modelled but
considere_ part of Task 7).
7. Flow Field Grid Generation is the task which computes the
c-_d_ates of t-he grid to be used during £1ow-field
computations on the FMP. The model assumed that this
task would be executed on the FMP when operator verifi-
cation of the resulting grid was not necessary. Other-
wise, it would be executed on the Support Processor in
order to have prompt display of results to the operato_ _.
8. Input Gathering is the task which is used to specify the
parameters of a particular simulation run and to begin
staging data to the FMP.
9. FMP Execution is the task which runs a job on the FMP.
i0. Preselected Data Display is the task which outputs data
Which had been organized during FMP execution to line
printers, to graphics terminals and to microfilm
printers.
ii. Interactive Post-Execute Display is the task which
supports the selective extraction of data from the FMP
output files. The data extracted would be requested by
and displayed to the user at a graphics console.
12. Debugging Display is the task of formatting and display-
ing (in some appropriate manner) that information saved
by the FMP when a run aborts.
13. Restart Dum_s is a subtask of Task 9 and involves taking
a snapshot of the status of a simulation run to be used
as a restart or initialization point on a later run.
Table F.I summarizes the important data for evaluated the model
studied.
L
F-4
B±
i
i
RI
Q
O
,_
o
i'-[
O
-i-4
Z
_2
(1J
Xl
to
hi
Ill
U
6'1
W
¢
U
W
m.q
_t
u
W
ii1
oJ
w
tel
u
,.I
W
E
U
H
l-m
Q
Z
'v
.'4
l_J
m
\
C>
.I 0
II
L_
_J
L"I L"_
f.. f_.
t.?t
.4
t.n IN
1"1
L_
,.4
,..I
L"I u'l t_
.4
L"m L_
r-° F.
.
A .9
m3 . .
• 4 ._
wz _ t-_. D
J o [] ,/i _ u
it_ I.._ _ U _
t_ W
.t
F..
w
f'.l
•*J 0
(:>
Q i_J
z
.4
t_
u
k
o g
W _
J 0.1
0
D
D
3
J
I-
t9 fl
U
0.1
o
QE W 3 :EU _
1:_ J :-.I Q
£H U3 r_ Z_ F_{k ZO
°
,.-z_ ¸ '
F-5
i__i_
F-6
W
u
_m_
c
0
u
o
.f-_
U w
0
.r,._
q_
WL
V
UI
¢¢
D
W
W
H
m.q
W
.J
b_
W
J
H
U
m.
D
_m
k
Q
U J
-1 t..
,g
::_ b'l i.'1
IN
P'I
4
¢J 6_1 l,j -t
D Z C; M
II"
Z L_tn =
.J
,...,,
"o
t:::
°,-I
0
o
v
.iJ
0
_4
o
(/]
0
o,--I
.0
t/l
I--I
,-I
..Q
I
i.
U
Z
o
14
tJ
w
d
l-
_0
In M
Z
In W _ -I
r,'m 5. I- t_
0 I_ -4
,4 iZl t'J
t,J
.-f
Ul
':_ f'.l ,4
• 4 ,-;
W _ LO H
.J "_" -" 5. N
_ .J .I-
W l--0 J W I-
UU _ W'-r UW
i. I-- t,m. 1,_
J
I
P-7
ro
:::1
,l.J
o
u
4.1
r_
0
o.-I
1..i
¢1
u
r-i
0
.IJ
t..t
¢1
r_
,-.-i
xl
E-t
• •
t"l
.¢
,4
.t
p.
v
.::>
,I
ill
L_
bd
Z
tl
M _ C Z
14
I_ Z _m
U
pJ
• 4 _ Iq .4
.t #J '._ ,4
-!
• t t*"l _'J ,4
• 4 _. <:> .:_
"t
.4
_4
s"
13 t_ I- b,. b.
_1
°o ,-, _ ,';
,t
v
-4
z ...........
t3
H
_ _ -
_ N
1.9 0,.
w
?'_ t:'i .4
f,j
.1
• 4 ,¢:.
L;"I
X
_" L,1
_ Wl- l- _t W I--
H i-- t3 _ I-" H
l- i.J U _: W "r U W
r'i i. 1o _ _.
F-B
.) r-.
I °
,.c,
w
Lq
u
tn
C: t_
or..¢ m_
u
0
0
::ra-
m
4-) w
u
O
.r.I
u)
O
,.-I
r-t
fel
w
u
i)-
i 1_j
.-¢|
w i
z i
w i
z t_,
¢¢
u w
_ u
Q _
t_
z
1-
I
I
I
-- I ..................
• ,'jrm
p_
- . . 4_
¢,
Z
., g_
:1"'°
"ZW
O Z
"l g,.,
_w
_ w
_g
."g
Jw
_ T
I-
z
L"I
o
_Z z
I--
z
W _ N
w
m.
_c
o IN
m. I'J b"l
J
w
t_
u_
I-- ,$
E
H _
z
_C rJ ,t
. .3
i.J
,., m _to o d_ t.
(c ".. i- Q ca
b.
• -I ..I $4rm . $
u]
,t to u'l ,-I
• 4 rJ #_J ,-I
,-t _'J ,=. .-{
.-$ -t
w
_. J -- W "
Z 1:3 _ Z U
W F- 0 J _C I--
I- UU • _ U W
J
r.
F-g
_.
N)
,.%,
N
_J
N
$
N"
gl
!: il
b
%1
Wu ¸
l.i
t
RI
i
0
•,-I I
_J
u w
IE
U
,-'I
,J
C
0
Wl
ui
i
I
z
D
o.
W
X
z
t.
..$
__ x
w
t--
x
w
z
w
.J
Q
jl
P_
z
D z
,.,.l_
b. I- lb.
,y
Id
I,.
i.
.J
u
i.i
z
u
l-
w
bl
z
z
J W _
_ %_ J M UW
J
F-IO
II
J
F
.!
!
J
i
"o
o _,
Q
o
o
c11
_2
,-q
." I
M
_:_ ,4
.4
.4
ii,.. ,4
_4
i1J
t_
-r i- w
I- W Q" Nm.q
Z_ WW ,*Q
t_ _3_ t_m_ tq_
r,_d tOZ tl
t,.
,4 40
.4
,4
i,. ._ -- W -r
W I- _ .i _, }...
I- _m.m _ Wt_ _W
qt w I- ._-r qt o(
c..
,4
o
Z
,4
,4
,-4 _:
,4
tn
_, W
0 hi
14 W ,_ ?'
,5. _ _t,_"
L.m ...t
L,,, _ L_
t.,m
,r
,,,.
0
0 N_
W _ _
N
-!
p_
(
F-If
o>_'
.?/
2
- #,,
- _r
w
4_ u
0
o
w
u
0
0 u
u
0
PJ m
w
w
r2 u
,-I
_ z
tl
H
U
t_
M _ f'*l c*
F-12
*4
I'J AJ
I'J
o o tel _,
i_ ,4 O*
v _d
g .S " _
glu ;'
H "7
T u"u
¢ W _
ae Z %
I- "3 I_ _ I.
.t
v
I-
PJ rJ
'::' V'
I"J
"1
PJ
rJ
o
PJ
S
:;1-
S,.,z ,._ g
H_ ..J_ H
,_ _,._
.J
I'J I_
P.
l.q
f-.
PJ VI Va ,=,
VJ VI
v
S ,_ _
,{
"_ S _
.4
_4
,4
r°
, U'I
,4
U_
.4
z
H
o
I1.
%
w u]
_o
"" VI :E Z'J J
i _t,;#
q_
0
t_
.,u
0
,_
0
*,-I
_2
.D
[-.t
f_
W
4J
w
tq
i.i
w
u
hi
U
W
_t V'M
#.m
W"
g a
,4
,4
.4
C _- .
lnr --I
m f"m
*'1 _CO
c,
w
o_
H
t_ W
U :- W Id
.4
o
_4
,4
,4
N
Z
Y
I-
'_ t2 t3 tJ _m _ Z mA W t3
_t
\
F-13
F-14
o
0
o
t_
.lJ
t_
0
...I
t::
o
gl
,-.-I
0
-r-I
Z
,.--I
f2
r_
Id
Ill
U
bl
Ul
LI
bl
LJ
Id
t_
Id o'1
gl
.... F
?J i I
1
'tinl,
g
................... .] ...............................
f,"l
o oo ".','I f"l I£I 3"
,I
0 _ FI I_I ::f"
,l ,,--,
,I
P. ¢:_ PJ I:'I rJ
I¢i
I_i i" J PJ pj
w
,. _' =
-I W U. W 13
T
I_j
_II I"I °t,t
ILl
1_" I',I
trl
b"l l,J u'-i
g ';
12
Z
o .I z
u._, ¢
W H _
_ n
_y o_ ,_,
I-:< ..J N
Z W rl ,_
-I Ul I--el
H l- n I- I- ,_
b, .I -- W 'I"
13 b. _ U
- _1 "," I-
I_ .IW _,',"
.J
b
,I
.4
.4
F-15
In addition to the information provided by NASA, it was necessary
to make some assumptions during the course of the analysis. These
assumptions were based in part on experience and in part on judge-
ment. Table F.2 summarizes these assumptions.
TABLE _.2
Significant Assumptions
0.2
40
5O
0.2
4000
8
0.25
0.i
0.5
Fraction of Users who use the Support Processor to do data
entry & editing
Average length (chars) of source statements
Average length (chars) of control messages
Fraction of a module fixed or modified on each edit
Average additional compiler output (characters) over
and above the source statements per module
Number of words of object program per line of source
Fraction of modules with bugs which are waiting change
and which will be batched as far as BINDING (linking)
Fraction of edited program codes which must be completely
bound or linked (others will be replacement bound)
Fraction of solution parameters (of an earlier run) mod-
ified to setup the next run
Number of characters out of FORMATTER for each word in
(used to format printouts & COM)
Number of 8 bit characters per word - (6x8 = 48 bits)
1.25 x I0" Max size of archive (characters)
.20 Fraction of file access from active data
.70 Fraction of file access from long-term data
.i0 Fraction of file access from archive data
F-16
F.3 ANALYSIS
The analysis first defined the sequence of events required to
implement each task. The relationship of system resources during
each task was then charted basedon an approach suggestedby Prof.
Anatol Holt (Boston University). These charts or diagrams are
intended to separate spatial relationships and temporal relation-
ships. Figure F.2 shows the resource-relationship charts for
Tasks 1 through 5 and 7 respectively. The interpretation of these
charts is straight-forward. The various NASFresources, sometimes
including equipm_.ntlocal to the users, are shown left to right
across the chart. The sequenceof events required to complete a
task is represented from the top to the bottom of the chart. For
example, the first chart of Figure F.2 is for Task 1 - Source
ModuleGeneration. The sequenceof events shownis:
Create File
Enter Records
SaveFileCreate File
SaveWorking File
The first Create File event involves the user, his terminal, data
commcontrols, the operating system on the support processor and
working files (in the support processor). Thus, the system
resources which interact to implement each event of the task are
delineated. The secondevent, Enter Records, involves a bulk move
of data. This type of interaction is shown with the curved
corners. This notation allows the natural flow of data, here from
the user to the working file, to be shownclearly. Each resource
involved in an event commits something of that resource to the
event, whether it be space (storage) or time. Thesecommittments
were the point of the analysis performed. Note that the charts
showncontain more information than was utilized in the analysis
to date. In particular, the processing capabilities and communi-
cations local to a user were not studied yet.
After the resource-relationship charts were prepared, they were
used as templates to prepare a straight-forward program which
would collect the operational scenario data in the manner de-
scribed by the charts in order to generate the results. Thus the
charts could be used to identify how many control messages moved
and what data transferred between elements of the model. The NASA
provided data was used to identify the average frequency of a
task,the amount of data involved, and the processing time
required.
F-17
o
o
c_
o
k_r-
.!J
0
o
0
o
0
e4
o
F-19
ooo
ILl
Dr)
Or)
)--
F-20
w
o9
wi,i_j
.2_-o
--O,)_
tL >. I--
O_Z
O
ILl
)--
Z
r,-
O-
c,-
-J_O
(O
00
.J
_O
¢O
.J
Z
co-
CC
w
r,-
w N
o_
f
-[
)--
c_)
m _j
)--
r_
O-O
I....
W cO
.J )-- )--
Q. W,,_ W
O _"' "_m "a
t_
_J
_o
L_Jo_
:EO_
O_.J
_Oa_
A
c!
O
t.)
_0
O_
.,-_
0_
O
._4
,-J
t.I
0
r..
_2
-el
,
I_t
(I)
F-
u_
__,-_
,nO
ua_ o_
E_.o--
I,.-
wWO
._j i.- ee
_..=p
_1
u.I:D
O0
o_
...1
I"- _ ¢" [
o
.-1
i '!
.,I
[r
] [
_-'- e.- u,_
w
0
.-4
c
0
o
C_
U}
C
0
i1J
o
0
m
Z
u_
v
_.H_:
F-21
:%,
tu
L9
Z
L_
w_
I
Lo
¢d5
)-=
Uj ¢_'
,,o
¢.0
_j I--r_"
_0
..I
I--_ I_
0
r,-
0
,_CO
ObJ
O(J
.JO
n_
Cn..3
r_d
_-- 0¢
f
-{
[,
LL
2
o(.0
_,j,,
--]
>-
bJh
J
2
o
co
_"oZ,.
bJ_
_J
o
_J
(Q
O_
.c
_0
r,
0
_J
U
kl
0
t/)
Z
_2
0
t,
::)
P_
.,..4
F-22
Zl
un
uJ __O:Z
_g
0
,o
o
,,4
o
,,-4
¢6
o
0
t_
f.u
F-23
F-24
Z
0
uJ
Z
UJ
(_o
c3
(.0
c3
.J
uJ
U.
0
._J
Lu
I
v
09
W_
o.
0
_9.--
UJ 0
[
0
E
I_,.i u_
[
I
0
t_
0
14
0
t_
Z
.,..I
tQ.
LL
I
e.-
o
,4..,
C3
(J
Q
C$
rn
[
rn
z10
r_
LD
rr"
(.D
L_
o
LL
I
p-
t--
hi
LO
UJ,:_
"Je:
(1)
(I)
U.lUJ
_ o
u_O
E_LI.
(.9 ,_ 0.
rJtOLL
_L
c_ o
-JO
(I)_I
r,'*W
E
_ ,
_l
V..
..Jr,=
.....
I..--
Ul
I
..... E
On,."
--I
W
_E
_°jE:
I
I
W
QO _0
:_ (/1 C:_
,el
0
u
t_
0
4-1
t_
14
0
01
F-25
F-26
c_
Q.
c_
l
o_
V-
t_
<.0
.3
t_
v_
!
r_
Or)
u_
t_
o_
t_
oa::
.d
LU --
b.J
0
0
(n
0
4J
0
(M
r2
r_
i
t
l
09
(3..
t.O
F-
Z
b_l
_3
b_l
I
b-
y
w
WWO
_o_
o-t--
o9
-J
_o
"=4[0
.J
O.J
or
o_.-_
,_n..
Or_b.l
tO OO"
CO _
P_
T_
I
_ c_
,_--u-
_o _o_
*,-I
0
O_
t_
0
-,-I
.io
r_
u
o
r.u
Z
.a u_
o
.,.-I
F-27
F-28
a)
I 1 s I
F-- z
_i I-. _I:
--_I-- r
_0
_._
c._ I.I.I
_Q
I.- F--
Z¢,o
z
.../
_o_
t-- ¢'_ ll"
,,_- [1 .....
0
-I
._ [1-------4
[]
......... [ ]---E
0
]----E ]-----[
,_ uJ
]----E
"O
,...,
4-I
8
m
,._
0
,.H
0
¢,1
_2
tz
o
a.
I.L
I
o_
.y.
co
I-
,,.o
00
_,.,J
_o
0
£.o r_'
o
LIJ
0_-,.
Q.
Z
mOu_
O.a-
I--0_
.'o
O_J
_-J
I--. 0 _ 0
_ [
n'o
OO
I
0
i F-29
!°I
I
v
F-
w
h_wO
_0
o
O.
v._l
.J
I_ 0 _
o
_<
I--
.j u.l
^'0_ . I ¢.n
"_ 0 a. t_."
0-0
Z
o
.,,-t
4.1
0
o
(/I
.,-4
,._
0
.,-I
o
o
0
t/l
_2
o
I..i
ty_
F-30
iR
|
t_
O.
oo
t._
a-"
m
I
0
(1.)
.R
o
F-
t
8
,.al
0
,-I
0
t.l
0
t_
t_
F-31
!.J
m,m
i
o
w
,.,,,,o [
u-_z/
o h
o
o I- i_.
°_z
5
I.-- O_
mn,-_
ge
1--
C
C
0
.r4
,._
0
r_
,--I
0
Ul
Z
_-3z i:
Ji
..an,..
o9
rx")
_IE ...j
1.1.11.uo
_.11-- _"
tz. >-. _
,o
!
b-
oo
0
o
I
oo
Zt_
Z
_.._0
.-l Or',
_ ______
0 o
._J
...J I--
LJ.J_
I--
n,,,
,L
o
t-,
0
o
.t--I
..el
0
°_-I
4J
0
Z
¢,q
o
o_
-,-t
F-33
(3.
0")
I
Od
Or)
w
chJ--
u.. >.- z
u)o
o
v j
_6e_
bJ_j
_"0_
>-_,,
o o
(I)
_J
,_o
(.3
..J
n,-
j.-
tu
-'_ (1.
bd'-_
c, C_O
.... [
c
.....
¢-_r-,
.... -{ ]
]
,'n$--
C
0
t)
u}
0
I1)
0
_4
.,-4
F-34
0i
e
!
The program developed to support the analysis allowed spec-
ification of the parameters of the support processor. Since the
number of CPU's needed as part of the support processor system was
of interest, the capabilities of a particular single processor
were defined. The analysis then showed the support processor
loading in terms of that one processor. From that analysis, the
number of processors required could be determined. For example,
if the average load was determined to be 1.6 processor-hours for
each hour, then at least two processors are required.
Table F.3 summarizes the characteristics of three support proces-
sors considered during the analysis. The processors are identi-
fied "A", "B", and "C". This form of identification was chosen
since none of the data has been completely verified on any proces-
sor. In general, the processors are characterized as
A B7700
B B7800
C Future Processor
The data concerning editing, compiling, and formatting on proces-
sor A is based on benchmarks run on a mix of FORTRAN programs.
The other values are estimates based on best knowledge and judg-
ment.
In addition, where it seemed appropriate during the evaluation,
some modifications were made to the NASA supplied operational
scenarios. In particular, the anlaysis was performed for some
cases without the COM output (Task 10C) in an attempt to identify
the impact of that large amount of output.
The analyzer currently generates results in terms of daily average
loads. Hand reduction of the results from the analyzer is used to
generate hourly average loads.
F.4 RESULTS
Before considering the results, a WARNING must be stated. The
analysis to date only considers average data rate and processor
time requirements. This would only be true under conditions of
optimum system balance and concurrency. Factors which are needed
to predict the peak rates which an eventual design must consider
have not been included.
The analysis considered three major factors of the NASF system
model. First, the data transfer requirements for transfers
between each of the components of the model in Figure F-I were
determined. Then the amount of file level activity was deter-
mined in order to estimate the processor required to support the
file system. Finally, an amount of Support Processor and FMP
processing time was determined using the assumptions previously
stated.
F-35
zh
TABLE F.3
SUPPORT PROCESSOR CHARACTERIZATION
...................................................... _. A B
Edit Time (Sec/Stmt) i .01 .0067
Compile Time (See/Stmt) ..................... ii-007 I .0047
.Compile overhea d (Sec/Modul-e)-- " -i!'.l._:--..:i_i:O:i_
Linking Time (Sec/Object Word) J .0007 .0 7
T
Linking Overhead (Sec/Code) | .01 .0067
n
C
I .........
i .001
.001
.013
Grid Generation Rate(Sec/Grid-Element)
Operating System overhead (Sec/Task) 0.0 I .067
" i" tt d [....._- 0006 ..........0004
Formatting Rate (Sec/Word Forma e ) IO. i "
........................ m ....
Output Selection Rate (Sec/Point |0.01 .0067
Selected) _...................
.0001
.001
.001 .00067 .0001
.0133
.0001
;0oi3.....
F-36
F.4.1 Processor Loadinq
Table F.3 summarized the basic characteristics of the processors
considered in this analysis. The data in this table concerning
output formatting is based on the normal interpretive execution of
a formatter package which is driven by the FORMAT statements. One
of the major reasons such a system is interpretive is the possibil-
ity of variable format statements. However, most of the applica-
tions observed (both the aero and weather codes among others) have
rather straight-forward, fixed formatting. The improvement in
formatting time that could occur if the formatter was compiled and
executed per the statements given rather than interpreted was
assumed to be a factor of 5. Each of the processor of Table F.3
were considered under both the standard scenario and under a
scenario with no COM activity (no Task 10C). In addition, all
cases were studied with both the existing means of formating and
with the hypothesized improvements. Table F.4 summarizes the
results in terms of support processor loading.
Note that processor "B" seem_ to be committed to 9.5 CPU hours per
hour when the standard scenario (interpretive formatting and COM
output) is considered. A i0 processor system would satisfy such a
requirement assuming a better than optimum multiprocessing system.
If the COM formatting task is off-loaded, then only .88 CPU hour/
hour are committed.
i
Now consider how this load is distributed through the operational
day. Figure F.3 shows a distribution of jobs over the day. This
distribution is slightly simplified from the scenario given. In
this case, 22 hours of operations were assumed. The loading shown
is such that no workload case which represents long job overlaps
with a work-load case which represents short jobs. When this
distribution of jobs is assumed, the average SPS processor loading
per sh_ft can be determined. Figure F.4 shows this evaluation for
processor B with no COM formatting.
Figure F.4 shows the same schedule of job execution as Figure F.3.
The columns on the right side of the figure show the total CPU
load determined for each case (output of the analyzer). The load
for each case was then averaged over the shift in which it is
scheduled to determine the average CPU loading over the shift.
The loading is shown a CPU-hours per hour. Note that in Table
F.4, .88 CPU-hours per hour is the average load over a day. How-
ever, Figure F.4 shows that when the load is distributed by shift,
the peak load is 1.64 CPU-hours per hour.
Even this rate is optimistic since the actual loading of an inter-
active system is not uniformly distributed across the day. Load-
ing will tend to have peaks and valleys. If there is a vari-
ation of 30% from the average, then the peak rate would be 2.13
processor - hours/hour. A trade-off between respons_,-time and
system complexity and cost must now be considered, l_h, limiting
the system to two processors of this sort, these proce_;sors would
be busy most of the time, depending on load peaks which are not
under the control of the operations staff.
F-37
Daily Average:
Table F.4
Support Processing CPU Hours/Hour
%
Scenario
.....P ROCESSORi_ FO_'ATTING .......... l"'With 'COM
] Interpretive
A i Direct Execute
Interpretive 9.5
Direct Execute 3.3B
l
Interpretive
C Direct Execute
................................... L-.
14.2
3.7
2.8
.7
Without COM
1.31
' 1.12
m
.88
' .76
M
.19
, .15
i .......
F-38
4--
The analysis described above is shown in Figure F.5 for processor
"C" (a future processor). The case shown includes the COM output
load and assumes the improved formatting rate. The analysis of
loading due to COM formatting is an approximation unfortunately.
The actual use planned is to produce graphics images on the COM
device. A sequence of many frames would become a movie showing
the dynamics of a model. Since no information concerning graphics
formatting was available at the time, the approximation was made
that COM formatting would be the same magnitude of task as format-
ting printed alphanumeric printout. This approximation may be
somewhat optimisic.
Figure F.5 shows that peak CPU loads on the support processor
occur during second and third shift. Note that the case shown in
Figure F.4 has the amin load during prime-shift. The difference
is the COM formatting which peaks in Cases 6 and 7 (see Task 10C
in Table F.I). The loading shown in Figure F.5 indicates addit-
ional processor time available during the prime shift (8 am EST to
5 pm PST).
One variation of the scenario was tested to see the impact on
support processor loading. The variation was to change the frac-
tion of editing done on the support processor from 0.2 to 0.8.
The increase in editing load brings the CPU loading from .315 to
.317 CPU hours per hour (average over the shift).
F.4.2 File System Activity
In order to begin to evaluate the file system in detail, the major
functional demands were determined, based upon the previously
described scenarios. Two types of demands were considered; data
transfer and control.
Data transfer demands were considered based on each of the inter-
connection paths in Figure F.I. The data transfer rates, averaged
by day and by shift (as defined by Figure F.3) are shown in Table
F.5. Here again, the rates are averaged either over the day or a
shift and do not consider peak loading. It is interesting to note
that if the Support Processor is relieved of the COM formatting
task, the rates for Support Processor -- File System (correspond-
ing to the first line of Table F.5) become 7.711K char/sec average
over the day and the three shift rates become 56.90, 15,38K, and
44.5 char/sec respectively averaged uniformly over the shift.
Again, the major reduction is during non-prime time.
Control [unctions were considered with respect to file activity.
Based on the NASA-supplied scenarios, the number of file
creations, file deletions, and file accesses per day were
determined for the active high-speed access files, for the long-
term, somewhat slower access files and for the slow access archive
files. In addition, the number of times that an active file's
contents were replaced by new contents was determin_,d. These
results are shown in Table F.6.
F-39
F-40
_'. '_ IJrl ". 0
W W W W _: W
+t
,t
W
t- t_ n O
-r
J ,-, w w _ q3
d+ d J d
,o o g N _ z
"" I
+ II +
l_i _. _ ;+.I
PJ
,-I -
J E El Z
a+ I+ W -
3'I:H
W -i IJ :::o _1
J o '.L U,_n_ J
"_ '9 'q
"_ J J t¢
._ ?,. ._
m.++ l.d
..... ,,tI_/OF "_++5
i,t_t+NODUC+uJ+_+, -, +_mpq,n_
U
Y
IA
¢
L_
Z
I-i
I'-
I"
q:
IZl
U.
U
t-
-r
H
¢,,
t_
W
U
¢¢
I-
¢1
r,.I
,-I
,.|
,=)
P.
:..1"
M
I'.1
Z
"4
,:rm
P.
'J'l
=:1"
f_
DJ
,4
.4
] I
, _ 'r _
r ............. 1 ..........
• _ i'_
.i...........1
T _
: 'J. ./i / J
"_ ;'N
'.'1
..........................t; ........l"
W
tO
__ _._
m., (_
m
1-I _
I-
J
Z D
rl W
'_ I-
W
.J
tJ
"4 i.I
W W I_I
_ -t...I .J 1.0
,Z r'mw I_I u.. b. _ bl _
IJ -I J _1 _4 2: U I_I
I--I Z .J _ H IJrl I-
W _- lJ'l I- b'l Z_.J Q_4
Lg ,fi -'y '.,I Z _ _-_ _'I
..I '.'11-13 '_Q I- I_ Q i. U
W Izl ..I I- _ .J Io W I._ W ,._
= ,., _., r', ._ ,._ n : W _'r W
I-
LL.
?,.
,'I
,r
i,j
w
w
o.
i1.
u
w
ne
.IJ
.M
o
o
t_
0
O
m
o
0
N
4-1
_2
0
.,-_
F-41
L_
I-
Z
14
I--
l-
t7
6.
0
U
I-
1.4
I:I
L_
L,I
U
I:)
IQ.
l-
O_
O.
O.
b_
0
bl
U
Z
d:
t',J
,-I
,4
[%
h.I
f,J
,'I
.,T.
,-I
.4
.-4
f.
',D
f',.l
IM
.-I
I
I.¢I I".
i;? _ '.9
r . '
L"I •
m"J G'.l ,_
r
I
I£I
f.-|
-'i
I:3
"I"
U
b_
Ul
,.3
0
0
0
o
_j
91
W w
lz: IzI bl
',J -I -I
-" u7 n't
I- W -,
•J W I-IA
O_ n
I_I '_ _._
..I L.I :3 _
W 0
I_ u, b. b'l Ln _
L_ %- Vl l::l ,-, ,-_
:- _ n I, :.- _
i:i _ u i_: "I" _t
L_ I- ['I IZl b.I..I
IJ D :_ I- _4 t_ Ld
Z:' t_ t9 .3 _ J
,,-T & ,.:
t-
,,,i
2:
b'l
I:1
h
IZl
:1
I/I
t:l
ZI
I1,
U
W
Z>
i..J
4-I
,,.q
I,.,
,,_
0
t_
t3
0
0
_k
u_
.el
P-42
£-
TABLE F.5
NASF DATA TRANSFER REQUIREMENTS
(with COM)
RATE (Char/Sec)
---D_y-- Hourly Average
Average I2M_3am .... 5ami5pm ...... 5pm-12M
Support Processor - File System
Support Processor - FMP
Support Processor - Users
File System - Users
File System - FMP
29,240
.050
4,453
24,260
163,400
83.388K
.02
.228K
3.002K
294.770K
16.678K
.08
8.125K
45.9K
210.032K
35.937K
.02
.187K
1.554K
73.770K
TABLE F.6
NASF FILE SYSTEM CONTROL ACTIVITY PER DAY
i
!
FILE ACTIVITY
Files Created
Files Deleted
Files Accessed
Files Replaced
ACTIVE
2483
2483
19810
1302
FILE TYPE
1127
1127
827.7
ARCHIVE
627.3
627.3
118.3
F-43
F.5 FUTUREWORK
In order to make the system analysis more accurate, benchmarks
must be developed which represent the work to be supported by the
Support Processor. The magnitude of editing, compiling, and
linking can be determined given the existing codes and given the
assumption that the compilers and linkers developed to support the
FMPwould be of the samecomplexity as that of the existing codes
on existing machines. Benchmarkscan be developed to study the
actual formatting rate and the SPScommittment to task management
and I/O. Moreaccurate estimates of the grid generation task and
the interactive graphics support tasks need to made.
All system-level modelling must be operationally based. That is,
the results of any system-level modelling should be easily
verified by direct observation of an actual system. If this
guideline is followed, verification of the models will become
straight-forward.
F-44
iit
2o
REFERENCES
Final Report NASF Preliminary Study, Contract NAS2-9456,
Burroughs, October 1977.
Final Report, NASF Preliminary Study extension, constract
NAS2-9456 Burroughs February 1978.
3. NASF Utilization, October 1978 Draft, NASA Ames
F-45
i
!
T
i
APPENDIX G
FMP FORTRAN EXAMPLES with ORIGINAL FORTRAN SOURCE
°
The listings which follow in this appendix are examples of FMP
FORTRAN codes with their original FORTRAN source code. The FMP
FORTRAN versions _re all mentioned in Appendix A with regard to
the analysis and simulation activities of the study. In some
cases (e.g. LINKHO, COMP2 of the GISS weather codes, and the BTRI
of the Implicit Aero Code), the resulting FMP FORTRAN is incom-
plete. These cases were taken only far enough to be able to
generate an accurate timing since a functional simulation was not
required at this time.
The listings provided are identified in the table below:
Figure Application Identification
G.I Implicit Aero Smooth
G.2 Implicit Aero BTRI
G.3 Explicit Aero OUTER
G.4 Explicit Aero TURBDA
G.5 Explicit Aero LX
G.6 GISS Weather COMP2 Section
G.7 GISS Weather LINKHO (part of COMP3)
G-1
Original FORTRAN
17a400
17a500
17_600
172700
17_800
Z7_900
173000
£73100
173_00
173300
173400
173500
173600
173700
173800
173900
174000
£74£00
174_00
174300
_74q00
17q500
174600
174700
174800
i7q900
175000
175100
£75_00
175300
175400
175500
175600
175700
175800
175900
176000
176100
176_00
176B00
176400
Figure G.I Implicit Aero - SMOOTH
R}.IPRODUCIBILITY OF THE
_}'JGINAL PAGE IS P()()R
G-2
I
i
JlO0
P_OO
30O
qO0
u_.lO
u,.;_o
u,qo
700 c
J.O00
.'LPO 0
:300
J.qO0
:500 .I.
J.600
.I.700
.1.800
.1.900
P.%.O 0
;'_00
_500
_600
_700
ZSO0
P.900
3000
_J.O0 3
_:='00
3BOO
3u, o0
_500
_600
_700
_800
3900
4000 4
FMP FORTRAN
RF/)RODUCIBILITY OF TB_
0RIG]NAL PAGE IS POOR
SUBROUTINE SMOOTH
CDtIMDH/BRSE/NHRX_JHRX_KHRX_LHRX_DT_6RHHR_GRHI_FSMRCH_
DMi_DYI_DZ_FV<5)_FD<5)_HD_RLP_O_aHE_R_HDXgHDY_HDZ_RH_
Z CNBR_PZ_ZTR_NP_ZNT_ZNT_NT_
DDHRXN /HDDEL/_ J=Z_Z00; M=¢fS0; L=_00
RE_ZON /THREED((J=_HRX-Z)_(K=_KHRX-_)_(L=_LHRX-Z))/
x = /HODEL(J_K_L)/
ZNRLL /HflDEL/ e(_)_(5)_SS_CT(5)_TZ_TZ_T_T_
WTH ORDER SMflBTNZN_ _D ORDER RT THE BDUNDRRIKS
DORLL /THREED(O_M_L)/ ; U_NG _ S_ _HU
TEMP = £,/e(JgKgL_6)
Oa i N=_5
CONTINUE
ZF (J,E_,_ ,DR, J.E@,JHRX-A) THEN
Tl = e(J+l_K_L_6)
e(O-I_K_L_ND_TZ)_TEHP
CONTINUE
ELSE
DO 3 N=I_5
TB:e(J_Z_K_L_6>
l _,X<e<J_Z_K)L_N)_TB * e(J-a_M_L_N_XTW) - 6,_CT(N))_TEHP
CQt|TZNUE
END_F
NEXTDD
ZF (K.Ee,_ ._R. K.Ee._HRX-£) THEN
DO _ N=£_5
SS = SS + u,SxSHU_<G<J_M*L_L_N)_TZ + _(J_K-I_L_N/xT_ -
Z _.xCT(N))XTEHP
CflNTZNUE
Figure G.I Implicit Aero - SMOOTH (Cont'd)
G-3
Original FORTRAN
&
170500
176600
£?6?00
176600
L76gO0
1770O0
177_00
177_00
177300
1?TWO0
_77500
177600
177700
177800
_77900
178000
Z78100
178200
1?SO00
_78_00
_78500
178600
Z?G?00
_78800
170900
17g000
179100
Z79_00
_79300
179500
179600
179700
179800
179900
180000
_80aO0
C"
C
_0 C_I4TINUE SHDDTH_
$HDDTN_
SMDBTHH5
OD B0 _ = _Jt| SHDOTHW6
DO 30 K = _KH SMDDTH_7
DO _2 L = _LHH _HDDTH_8
KL = _L-_XND*K _HDDTHq9
;1 = KLcND SMOOTHS0
XZ = KL_xND SHBDTH51
Z3 = _L-ND SMOOTH52
_ = KL-_xND _HDDTNSB
DD _ N = 1_5 _HDDTHS_
32 S(KL_Nsj # = S<KLsl4_J)-SHU_<e<Z_sl4_d)xe(Z2_$_j).e<Xq_14sJ)x SI'|ODTH55
£ G_I_f6_J)_6,xe(KL_N_J)xe_KL_U_-q,_e{Zl_N_×e_I1_6,JJ-_,_ _HDOTH56
Q{I_HtJ)x@(IJ_6_J)>/_{KL_GsJ_ SMOOTH57
DO _0 N = 1,5 _HDDTH56
KL = ND+K SMOOTH59
;1 = KL_ND SMOOTHS0
_Z = EL-NO _NDBTH61
RETURN
END
e(KL_$_J#_e(Z2_NsJ_Xe_2_6sJ))/e<KL_$_U) _HBDTN63
KL = LHHxND+K $HDDTH6W
I1 = NL_ND SMOOTH65
_2 = EL-NO SMOOTH66
$(KL_N_U) = $(KL_N_Jj_$HZ_(e(ZZ_N_JJ_e(II_JJ-_,_e<KL_I4,J)xSHDDTHb7
Q(KL_6_Jj_G(I2_N_JjXe(I_6,J))/G<KL_6_J) SMOOTH58
CONTINUE $MDDTH_9
SHDDTNT0
SHDDTH71
Figure G.I Implicit Aero- SMOOTH (Cont'd)
G-4
qlo0
q_O0
q3oo
qt_oo
4500
q60o
4700
q800
_9oo
5000
5100
5_00
5300
5qO0
5500
5600
5700
5800
5900
6000
6100
6_00
6300
6qo0
6500
6600
6700
6800
6900
7000
7200
?DO0
FMP FORTRAN
ELSE
DO 5 N=195
1 _,_(e(J_K*I_L_N;xT3 + e(J_k-i_L_N)XTq) - 6,XCT_N))_TEHP
CONTINUE
ENDZF
NEXTDQ
IF (LoEe. E .DR. L.Ee,LHRX-I) THEM
TI=Q(J_K_L*I_6)
O0 6 N=1_5
CONTINUE
ELSE
T3 = e(J_K_L.I_6)
Tq = G(J_K_L-i_6)
DB ? N=I_5
i _,Xe(J_K_L_I_N)XT3 + 4.xe(JrK_L-i_Iq)_TW
g - 6,XCTCN))RTEMP
CONTINUE
ENDIF
ENDDD /7HREED/ | _IVIN_ $
RETURN
END
Figure G.I Implicit Aero - SMOOTH (Cont'd)
G-5
°
r,t, ,
Original FORTRAN
&
_VILK
42_00
42300
4_qO0
42500
42600
q2?O0
42800
42900
43000
43£00
43_00
43300
43400
43500
43600
43700
43800
43900
44000
44£00
44_00
44_00
44400
44500
44600
q4700
44800
44900
45000
45100
45200
45300
45400
45500
45600
45700
45600
45900
46000
46100
46_00
46300
46400
46500
46600
46?00
(VIOOOOI5450SRH)IMPLICIT/BYRI DN NS$
_UBRDUTIN£ BTRI(ILR_IUR)
CDHH_N/BTRID#R(60_S_S)IB(60_5_5)_C(60_S_5)_D_60_5_5)_V(60_5)
DIMENSION H_5_5)
REflL LI¢_L_I_L_L3Z_L3_L33_L4_L4g_L43_Lq4_L51_L5_L53_L54_L55
ZL=ILR
;U=IUR
.I_:IL+I
ZE=ZU-&
C ZN_ERT LUDEC
UI4 = B(IL_Z_4)xLZ1
L4Z=B(IL_4_£)
Lq_=B(IL_4_Z)-L4_xUZ_
Lq3=B_IL_4_3)-LqzAu£_-Lq_xU_3
U34=<B<¢L_)-L31_UZ4-L32_U_4)XL33
Lqq=l,/(B(IL_q_-U14xL4Z-U_q_Lq_-u34_L43)
U35=(B(ZL_5)-L_lRuzS-L32XU_5)_L33
LSI=B(IL_5_I)
L5_=B(IL_5_)-LSZxuz_
L53=B(IL_5_3)-L51XUI3-LS_XU23
L54=B(IL_5_4)-LSIXUI4-LS_RU24-L53_U34
U45=(B(IL_q$5>-L41_UI5-L4_XU_5-L4_XU35)_L44
L55=l,/(_(IL_5_5>-L51_uIS-LS_U_5-L53_U35-LSq_u_5)
C C_HPUTE LITTLE R
DI=LII_F(_L_I_
D3=L33_(F(¢L_3;-L31_OZ-L3_XDZ)
D_=L44X_F<IL_4)-L41_D1-L4_XO2-L43XD3)
DS=L55X(F(IL_5)-LSZ_DZ-LS_AD_-L53XD3-L54_D4)
C CDtIPUTK BIG R 5
F_IL_5)=D5
F<ZL_W}=D4-U_SxD5
F(_L_3_=D_-U34xF(IL_4)-U35XD5
F(IL_I)=DI-Ui_xF(IL_-UID_F(IL_)-UlqXF<IL_4)-UI5_D5
Figure G.2 Implicit Aero - BTRI
G-6
I0o
11o c
i_0 ¢
• 30 c
14o
15o
160
I?o
•_0 c
190 c
_00 c
_30
z40
250
Z60 c
z?O c
z8o c
_90 c
3OO
310
3_o
34o
FMP FORTRAN
SUBRDUTINE BTRI(IUR_
RSSUHE STRRTZNG INDEX = i
COHHBN /BTRZD/ A(IUR_5_5)_ B(IUR_5_5)_ C(ZUR_5_5)_
1 D(IUR_5_5)_ F(IUR_5_
DIHEN_IDN H_5_5)
ZHPL_CZT RERL_L_
;NSERT LUDEC <_IMPLIFIED FOR DIRGBNRL INPUT RRRRY B) FOR I=l
Lil = _.1S(I,191)
L33 = 1.tB(ls)_3)
L44 = 1./_(1_4_47
L55 = z./B(Z_5_5)
COHPUTE LITTLE Rl_ DHiTTED, THESE TEHPDRRRiE_ NDT NEEDED
THI_ PR_, CDtlPUTE Bi_ RI$
F(¢,5) = L55
F(I_4) = L44
F<I_3) = L33
_<I,E) = Lga
_(I_i) = LI1
Figure G.2 Implicit Aero - BTRI (Cont'd)
G-7
Original FORTRAN
46800
45900
47000
47100
WTZO0
473O0
_7400
47500
47600
W7700
qT_O0
47900
48000
48£00
4_00
4_300
48400
48500
4_600
4_700
48800
48900
49000
49£00
49200
49300
49400
49500
49600
49700
49800
49900
50000
50£00
50_00
50300
50400
50500
50600
50700
50800
50900
5£000
5_100
12
C
£4
C
£1
C
COMPUTE C PRIME FOR FIRST RON
DI=LI£AC(IL_£_H_
Dq=Lq4_(C(IL_q_H)-Lq_DZ-LH_XO_-Lq3xDB)
D5=L55X(C(IL_5_H/-LS£_D1-LS_xD_-L53XDB-L54_D4)
_(IL_5_H)=05
B(IL_g_Id)=D4-U45xD5
B(IL93_H/ = D_-U34xB(IL_4_H/-U55_D5
B(XL_Z_H) = DZ-t,_XB(XL_3_H)-U24xB(IL_q_H_-U_SXD5
B(_L_H) = DZ-U3_XB(IL_II)-U_3_B(IL_H)-U14_B_IL_4_H)-UIS_D5
00 i_ I=IS_IE
CD|IPUTE B PRIHK_BI_R
DO £4 N=I_5
F(I,N)=F(I_N/-R(I,N_I)_F(I-I_a#-8(I_N_>xF(I-£_2>-R(I,N_)_F(I-I_
x)-A_I_I4_g)xF(I-Zf4)-R(I_I4s53AF(Z-£_5)
CDHPUTE _ PRIME
DO 11 N=I_5
II4SERT LUOEC 8_N
UIZ=H(£_)XL/Z
UZ3=H(I_3)_LII
UI4=H(I_4)XL££
U_5=H(/_5)_Lil
L_i=H(3_i)
L_=Hf3,a)-LB£*UZa
LBB=i./(H_B)-UI_XL3I-U_3XL3_)
L41=H(4_Z_
LW_=H<q_)-LWixul_
Lq_=H(4_3)-L4_AU_B-L4EXU_3
L44=i./(H(q_q}-u_4xLqz-u_4_Lq_-u34_L43)
U_5=(H(B_5)-L3£XuZS-L_XU_5)XLB3
U5£=H<5_£)
LS_=H(5,E)-LSZ*uI_
Figure G.2 Implicit Aero - BTRI (Cont'd)
G-8
FMP FORTRAN
350
380
37O
380
5g0
400
ql0
q30
440
qs0
460
q70
q_0
490
50O
510
.53O
540
55O
56O
57O
580
59O
6OO
610
630
6q.O
650
660
67O
690
700
710
730
7_0
750
760
770
C
C
C
C
C
C
c
C
C
C
c
c
c
14
c
C
c
11
C
c
C
CDHPUTE C PRIHE FOR FIRST RUN
C HRS BEEN ELIHINRTEB R_ R $IHPLE
RE_UBSCRIPTING OF THE D RRRRY
B(195_11) = L55 X C(I_5_l.|;
B(i_4_H; = L44 x C(I_4_H)
B(I_3_11) = U_3 X C(IsS_H)
CONTINUE
HERE NON STRRT$ THE HRIN LOOP _F BTRI
DO 13 I = _,IUR
CDHPUTE B PRIHE _ BIGR
COttPUTE B PRIME
DO ££ N = I_5
00 1£ H = £s5
R(I_N_) X B(I-£,_H) - R(Isl4_C) X
iN_ERT LUDEC R6RIN
HERE SHRLL BE INSERTED R COPY OF THE FORHER LUDEC,
EXRCTLY R_ _N_NN _N THE IMPLICIT CODE COHPILRTION BY $CHREFFER
Figure G.2 Implicit Aero - BTRI (Cont'd)
G-9
Origina i FORTRAN
51_00
5_qO0
5_500
51600
51700
5_900
5_000
5_I00
saaoo
5Z_O0
5_400
5_500
5_600
5_?00
5_800
5_900
55000
53X00
55E00
53300
53q00
5_500
5_600
53700
53_00
5_900
5quO0
5q_O0
54_00
54_00
54_u0
5W500
5q600
54700
54800
54900
55000
55100
55_00
55300
55W00
55500
55600
55700
c
c
c
_5
"3
C
17
c
c
U45=_H_W,5)-Lqz_u_5-LW_U_5-L43_U35)_LWW
COHPUTE LITTLE R°'_
D_=LWW_tY(I,W_-LqZ_DZ-L_D_-L_3_D3)
CDIIPUTE _IG R"_
_',ii,5)=D5
F_I,q_=D½-U45_D5
C_I1PUTE C FRI|IE_
b_ _5 H=Z_5
DS=L55_C(I,_Its-LSZ_Di-LS_D_-LS_D_-LS_Dq)
_(I_5_II_=D5
_I,4,H_=O4-U45xO5
_itifl4) = Di-Ui_(It_|I)-UZ_xB<I,3,HJ-UiHx_(I,qII4)-U£5_D5
CDMTINUE
COHPUTE _ PRIHE_BIG R FOR L_T ROH
x FtI-i,_)-_I_N,qJ_F(l-i,4/-_ti,N,5)_F(I-i_5_
COI1PUTE _ PRIME
DO i_ N=i_5
INSERT LUDEC RGRIN
Figure G.2 Implicit Aero - BTRI (Cont'd)
G-10
tFMP FORTRAN
780 c
z90 c
_00 c
_._0
8wo
050
_60 c
_70 c
_80 c
,390
900
91o
9:)0
9t_o
95o
960
97O
980
990
1o00
1010
10_0
1o3o
1o4o
1050
_o60
;.O7O 15
_.080
Z090 13
11o0 c
1110 c
1
CONT Z NU',:"
THI_ I_ THE END QF THE HRIN ." LDDP_
COIIPUTE L;TTL£ R'_
D3 = L53 _ _'r_Z_3) - LSZ x DZ - L_ × D_>
D4 = L4_ _ 6F(X_W) - L4i _ D1 - L4_ _ D_ - L43 K D_;
05 = L55X(F(I95) - LDIXD£ - LS_XD_ - L55_D5 - L54XD_)
C_IIPUTK BX_ RmS
r_Z,5) = 05
v<x,4) = 04 -u45xo5
r_Z_>) = O_ - U_4xF(l,4_ - uDSxo5
F_19i) = D1 - U_KF(I,_) - Ui_F(X,S, * UiW_F(|9_) - UIS*D5
;F <_ .LT, _UR) THEN
DO £5 tl = ¢,5
D£ = LI£_C(I,i,II)
D4 = Lttq_<C(I,q_lt._ - LWCxDi - L4_XD_ - LH3AD3)
D5 = L55X(C(Z95,H) - L51_0£ - LS_xOg - L55xD_ - LS_xD_
8(Z_gH) _ D5
B<I_I4_ : Dq - UqSXD5
B<I9_14) : D3 - U_4xC(I_W,H_ - U_5XD5
B(I_Z_H) = O_ - U_3XB_I_3_H) - U_4XB(I,q_II) - U_SxD5
8kI9£9|4) = D£ - UI_B(Ig_gH_ - UI_B(X,_It_ - UI4XB(I,H_H;
- uzS_o5
CBNTXNUE
IIqCLUDIN_ I=IUR
Figure G.2 Implicit Aero - BTRI (Cont'd)
)!
REPROI)U(;IBII,I'rY Ol_ _J_
e)_T_V_I PAGE Lq I'l_()l_
G-ll
Original POTRAN
55600
55900
56000
56100
56a00
56500
56400
56500
56600
56700
56800
56900
5?000
57100
57Z00
57300
57400
5?500
57600
5?700
5?500
57900
55000
58100
58200
56300
56400
58500
55600
55700
58800
58900
59000
59£00
59200
59300
59400
59500
59600
_o
zg
U&5=H(£,5)ALI£
LbL=H<5,L)
L3E=H<5,a)-L31^UZa
U_5=t'Hk_J-L_LAUZS)_LL_
LqZ=H_W,L/
L45=HfW,_)-LH£_UZ3-LHE%U_
U_4=(H_5,H)-LScAU_4-L_xu_qJ_L53
L44=Z,/<H(4_4s-U_HXL4_-,U_H_LH_-u3q_L_5)
U55=IH_5,5)-L_IAUlS-L3_XU_5)_L55
LS_=H(5_Z)-LS£AU_
L55=H_5,_)-LSlAUZ_-LS_U_
LSH=H_5,q)-LSZ_U,_4-LSZ_u_W-L53*U34
U45=<H(4_5/-L4Z_U_5-L4_RU_5-LqSAU_5)_L44
L55=Z,/(HkS,5)-LSZ_UIS-LS_u_5-LS_Au55-LSq_u45)
CDItPUTE LITTLE R"5
D_=L££RF(I,Z)
DS=L55X(F(I_5)-LSZ_Dl-LSExD_-LS_AD_-L54_D4)
CDItPUTE SIS R"S
F,:I_5)=D5
F(I_41=D4-U45_D5
F(I,_)=D3-U3_XF(I_A-U55xD5
F(_£}=Da-U_F(I,5)-UZq_F(I,q)-U£5_D5
F<I_Z)=DZ-U_*FCIta)-U1BXF(I,3)-U£W*F(I_q)-u£SAD5
I=IU
D_ £9 N=I_5
iF <I,GT,IL;6DTO_0
RETURN
END
Figure G.2 Implicit Aero - BTRI (Cont'd)
G-12
4!
1_0 c
_30 c
1_40 c
_£50
1£70 19
_180 _0
_ZO0
FMP FORTRAN
REPRODUC_rfY OF TIIE
ORIGINAL PAGE 18 POoI%
t4QTE THE HE_RTZUE CODE XNCREHKHTS ZN THE NEXT SECTZDN
DD _9 N=Z_5
CQNTINUE
RETURN
END
Figure G.2 Implicit Aero - BTRI (Cont'd)
I
G-13
Original FORTRAN
7_700
7_800
7_900
73000
73100
73_00
73_I0
73_13
73_iq
73_15
73B00
73q00
73500
73600
7_700
7_800
73900
7_000
74100
74_00
7_300
74_00
7_500
7_600
7q700
7q800
74900
75000
75i00
75_00
75300
75q00
75500
75600
75700
75800
75900
76000
76100
76_00
76300
76400
C
CC
CCRLL
CCRLL
CCRLL
C_x_
_UBROUTINE OUTER_J_JE_K_,KE/
_x _UTER _OUNDRR_ COt_O_TI_N_ xx
RZ
R3
A4
COIIHOII/RZI/ RNO(_l_31_31)_RHDU(_I_31_/)_RHOV(_I_/_31_
COItHDN/RZ_/ RHDN_31_31_3£>_E_D_,31_I)_EI<_I_£_31)
COHHDN/_13/ U_3_Dl_31/_V_31_31_31),N(3£,31_31)
CDtftfON/R3/ "(_'_I/_D'_CELL_D£},USZ_UEI_U_I_JEI_ULFH_JL_'EF_'(H
_Z(DI),DZCELL_£_S£SKEI_K_KE_KLFH_KL_ZF_ZH
CDHHDN/R_/ LSHK,ILE_Z_,XL_R¢_K_K_K_K5
DOHNSTRERH RT I=IL
DO i K=K_KE
DD _ J=J_JE
RHO _IL_J_=_HB _IE_J_K_
RHOUkIL_J_K_=RHOUkZE_J_K_
RHDU{ZL_U,_=RHOV_E_J_:)
RHDN(ZL_J_K/=RH_N_ZE,J_K2
CONTINUE
ZF(JE,LT.JE_ GO TD
UPPER _o C. _T J=JL
DO _ K=k_,KE
DO _ Z=_ iE
RHD (I_JL_K)=RHO _I_JE_K}
RHDU<Z_JL,f_=RHDUkZ_JE_K)
RHDU(I_L_K_=RHDV_Z_JE_R)
_HOH_I_JL_K/=RHON_Z_JE_,R}
CONTINUE
CONTINUE
IF_KEoLT,KE_ RETURN
EOGE S. C, _T E=KL
DD _ J=J_JE
DD _ I=_, IE
RHO (;_J_KL)=EHg fIfJ_KE_)
RHDU_I_J_KL_=RHOU(Z_J_KE_)
RHDU(I_J_kL_=RHOV_Z_J_KE_)
RHDNkI_J_KL_=RHDN(I_J,KE_)
CDHTINUE
RETURN
EHD
Figure G.3 Explicit Aero - OUTER
G-14
t
4
;LO0
._35
:_9
J.qo
_.50
;60
:70
;80
_.90
_00
_05
_09
P_O
_qO
_60
.'65
,_?0
?-8O
Z90
_00
31o
3_0
_30
_50
_60
370
FMP FORTRAN
_UBR_UTZNE QUTER<JS_UE_K_KE)
CDHN_N/RZ&/ RHD_ZU0_00_&00)_RHDU_Z00_I00_£U0)_RHDU(100_I00_I00_
CDttHDN/RZ_/ RHDH(IOO,&uO_£OO)_K<ZOO_ZOO_ZOO)_EI(IO0_IO0_£O0)
CDtlHDN/RI_/ U_ZOO_zOO_ZOO)_U(ZOO_ZOO_ZUO)_N(IO0_IO0_IO0_
CDtIHDN/R3/ Y(ZOO)_DYCELL_LOO)sJ_I_JE&_J_UK_sOLFHsJL_'(F_YH
_Z(100)_DZCELL_Z00),kS_KK_sKSZsKE_KLFMsKL_ZF_ZH
CDHH_N/R_/ X_HK_LE_ZE_IL_KZ_K_K_Kq_K5
O_NN_TRERH RT _=_L
DDRLL K=KS_KE;J=J$_JE ; U_XN_/RI1/_/Rl_/_IR_/_/Rq/
RH_(ZL_J_K) = RHD(ZE_J_K)
RHBU_ZL_J_K) = RNDU(ZE_J_K_
RHDU(ZL_O_K/ = RHDU_IE_J_K)
RHDN_ZL_J_K) = RHDN(ZEfU_K)
£_Z_J_K) = E(ZE_JfK/
IF <JE.LT.3£_ G_ TO 3
UPPER B. C. RT J=JL
RHD(Z_JK_K) = RHD<Z_JE_K)
RHDUkt_JK_K) = RHDU_Z_JE_K)
RHDU(Z_JK_K_ = RHDV(I_ JE_K)
RHDN(I_JK_K} = RHDH(Z_UE_K)
£_I_JKtK) = £(Z_JE_K)
ENDDD| GIUIN_ /RZI/_/RZ_/
ZF _K.GE.KE_) THEN
EOGE B.C. RT K=KL
DDRLL J=JS_dE_I=_fIE ; USIN_
RRDU(Z_J_KL_ = RHDU_Z,J_KE_)
RHDU(I_J_KL/ = RH_U_I_J_KE_)
RHDN_I_J_KL_ = RHDN(X_JfK£_)
ENDDD; _IUIN_ /RI£/_/RZ_/
£ND_F
RETURN
END
'R_(;f.",'_t, PA,.,[_ iT" t ',
Figure G. 3 Explicit Aero - OUTER (Cont'd)
G-15
o'- L' • , _ • _ :o' :, : "', , .. ='. .,, ,,,.
Original FORTRAN
43_00
_3300
43400
43500
43600
43700
_37£0
4371_
_3713
_371_
437£5
43716
43717
_3718
43719
43800
q3gO0
44000
44100
4_EO0
4_300
44400
44500
½4_00
44700
44800
CCRLL
CCRLL
CCRLL
CCRLL
CCRLL
_UBRBUT[NE 7URBDR
R3
R4
R5
e6
COHHBN/R_/ RHD<31_31_31))RHnU(31_31_31)_RHDU(31_5_)31)
CDHHO_IRI_/ RHON<31_31)31))K(31)_I)31))KI<_£)31)31)
CUHHDN/RI3/ U(31)3i)31))V(31)31)31))_<31_31)31>
C_HHUN/R14/ F(_)5)
COHHSN/R_/ PRDICT(3_)5)_P(Ba)
COtIHON/R3/ 7(31))DYCELL(_I))JSI_JEI_J_)OE_fJL_H)dL_YF)YH
1 _Z_3_))OZCKLL(31)_K_I)KE1)K_KE_KLFHsKL_ZF_ZH
C_|IH_N/R4/ ISHK_ILE)IE_IL_KI_K_K3)K4)K5
C_HHDN/_5/ 6RHMR)_RMH1)_RMMPR)CU_CUI,STDKES_UU)C0)PD)RHDU)RL)XU
C_HHDt_/R6/ RHUL(31)31_31)
CVI=I,/CU
Oa i K=_)KL
D_ 1 J=i)JL
DD i _=I_IL
7£HP=RB$(KI(I_J)K;)_CUi
_F<K,Ee,I) TEHP=,SXRB_(EI(I_J)i)+EI(I_J)_))xCUI
IF(J,Ee.I) TEHP=,5_RBS(EI(I_I_K)+EI<I)_K;)_CUI
RHUL<Z_J_=_,_70£-OS_eRT(TEHPR*3)/(TEMP+I98,6)
i C_NT_NUE
PETURN
END
Figure G.4 Explicit Aero - TURBDA
G-16
!
t
10o
zoo
_05
_Io
_0
_3o
_0
g5o
280
._90
700
750
go0
_000
1100
i_00
1300
1500
1600
1700 I
i_00
1'900
;'000
.(-
FMP FORTRAN
SUBROUTINE TURBDR<CU)
CDHNON/Rll/RHO(i00_100_I00)_RHDU(100tI00_100)_
I RHOU(£O0_IO0_IO0)
COHHON/R_/ RHON(100_I00_I00)_E(£009100_i00)_EZ(100_100_I00)
CDHHON/R131.u(lOO_IOO_IOO)_u(IOO_£OO_lOO)_N(100_£O0_lO0)
COHHDN/RIH/ F(_95)
CDHHDN/R_/ PROICT(101,5)_P(101)
CDHHDN/R3/ Y(100)_OYCELL(Z00)_JSI_JEI_J_JE_JLFH_JL_YF_YH
i _z(IOO)_DZCELL_ZOO)_KSZ_KEI_KS_KE_KLFH_KL_ZF_ZH
COHHDN/RH/ I_HK_ZLE_ZE_Z_Z_K_K_K5
CDHHDN/_5/_RHHR_RHHI_RHHPR) EU_CUI_TDRES_U0_C0_P0_RHD0_RL_X0
COHH_N/R6/ RHUL(IO0_£O0_IO0)
DOHRIN /EXPLCT/|I=I_Z00|J=Z_I00_K=I_I00
ZNRLL/EXPLCT/ TKHP
Cvl = 1,0/CU
DDRLL J=JSI_JE_K=K_I_KEZ; USING /RI_/_/RS/
DO i I=I_IL
ZF (K,Ee,1) TEHP=0,5_RBS(EZ(ItJ_/_EI(I_J_))XCUI
ELSE IF(J.Ee.1)TEHP=0.5_RB_(EI(I_I_K)+EI(Z_K))XCUI
ELSE TEHP=RBS(EI(Z_J_K})_CU1
ENOZF
RHUL(Z_J_K) = _._70E-0_SeRT(TEHP_X_)/TEHP%Ig_,6)
CONTINUE
ENDDO; 6ZUZN_ /R6/
RETURN
END
Figure G.4 Explicit Aero - TURBDA (Cont'd)
%
?
G-17
Original FORTRAN
%
76600
76700
76800
76900
?7000
77100
77200
77300
77310
77311
77312
77313
7731q
77315
77316
77317
77318
773_I
77400
77500
77600
77700
77800
77900
78000
7OiO0
70_00
78300
78400
78500
7_600
78700
78800
?_g00
79000
79i00
79200
79300
79_00
79500
C
CCRLL
CCRLL
CCRLL
CCRLL
CCALL
CCRLL
_UBROUTINE LX
LX DPKRRTOR
81
8a
83
A4
A5
87
COHHD|4/RII/ RHD(31_31)31))RHQU(31)31)31))RHDV431_I)31)
C_IIHDN/R12/ RHUN(31)31)_I))E(31,31_31)_EI(31S31)51)
CDIIHUN.'RI3/ O_3_31)31)_V_31)31_31))H(31_31_31)
CUHHDt4/R_4/ F(2_5)
CDI4)tON/R_/ PRDICT<32, 5> _ P(32)
CDHHDN/B3/ Y_BI))DYCKLL_I))J$1_JE1)J_)JK2sJLFH_JL_'_F_'(H
_Z_31>)DZCELLk_I>)KSlsKElskS_,KK2)KLFH_KLs=FIZH
CD|IHDN/R_/ ZSHK)ZLE)ZE_IL)KI)K_K3)K4)K5
COtlHgN/RS/ GRHHR)GRHH1)_RHMPR_CU)CVI_TDKE_)UU)C0_P0)_HD0_RL_XU
CDHMDN/R7/ DXsDXZsDY_DYZ)DZ_ DZI)EINRLL)Z_DBHL _DT)CFL_CDII_T
DTDX=DT_DXI
DO 1 K=KSl)KK_
DO _ J=J_i)4E_
DD 3 I:I)IL
PRDICT(I)I)uRHB %I,J)K)
PRDICT(I)_)=RH_U<I,J)K)
PRDXCT(Z_3>=RH_U(I_J_K2
PRDICT(I_q)=RHBH(I)J)K)
PRDICT(I)5)=K (I)J)K}
P<I)=_RHH1 _RHD(I_,K)XEI<I)J)K)
3 CONTINUE
_=1
IROOut_-i
HHI=N-I
B=I./H
II=I+IRDD
U£I=U_II)J)K_
CRLL FX_UZI)Z_J)K)_)
DO 5 _=2_ZE
K3=KI _KZ=K_ _K_=K3
ZI=I+IRDD
Figure G.5 Explicit Aero - LX
G-18
b£00000
100£90
Z00800
£00900
£01000
£0£100
I01_00
£01400
£01500
£01600
£01700
101710
101_00
10E000
10_£00
10_00
10E500
10zq00
£0_500
10_600
10_700
_0E800
10_900
105000
£03100
105_00
103500
103_00
103500
_03600
105700
£03800
£03810
1038_0
103900
FMP FORTRAN
_UBRDUTTHg LX
L_ QPERRTDR
CDIIHQN/R£1/ RHD_Z00,Z00_I00)_RHDU(1009£U0_£00)_RHBU(100_100_£00)
CDIIHQN/RI_/ RHDH(zOO_IOO_ZOO)_K(ZOO_ZOO_£OO)_EZ(100_100_100)
CDIIHDN/R13t U(100_£00_100)_9(100_£00,100)_N(100_100,100)
COHHDN/R£_/ F(_5)
CDII|tDN/R_/ PRDZCT(£Oi_5)_P(101)
CDtlHDN/R3/ Y(IOO)_DYC_LL(100)_J_Z_UgI_J_JE_ULFI4_UL_F_yH
£ _Z(iOO)_DZCELL_£OO)_$£_KEi_K$Z_KE_KLFH_L_ZF_ZH
CDIIHDN/Rq/ ZSHK,_LE_ZE_ZL_KL_K_K_Kq_K5
COHHQN/R5/ _RHHR_6RHH&_RHHPR_CU_CVZ_=TDAE_U0_C0_P0eRHD0_RL_XU
CDHHDN/RT/ DX_DXZ_DY_DYZ_DZ_ DZZ_E=NRLL_ZROBNL _DT_CFL_CDNST
DDHRZN /gXPLCT/:Z=l_ZU0|J=Z_ZOO_K=Z_100
DTDX=DT_DX£
DD 3 Z=I_ZL
PRDZCT(_=RHDU(Z_J_K)
PRDZCT(Z_3)=RHDU(Z_J_K)
PRDICT(I_)=RHgN(I_U_K)
PRDICT(Z_5)=E (_J_K)
P(I)=_RI1H£ XRHD(_J_K)_EI(I_J_K_
3 CONTZNUE
DD q N=I_
IRDD=N-i
NH&=N-£
B=£,/N
UIZ=U_II_J_K)
CRLL FX(UIZ_Z_J_%Z)
Dg 5 I=_ZE
KS=KZ
KZ=K_
Z_=Z+ZRDD
Figure G.5 Explicit Aero - LX (Cont'd)
G-19
G-20
79600
79700
?9800
79900
80000
80100
80_00
80300
80qO0
80500
80600
80?00
80000 C---
00900 c
8_000
81100
81200
61300
81_00
8t500
8£600
0i700
8_800 6
81900 c,xx
82000
82100 g
8_00
oa3oo q
8_500
8_600
32700
82800
8_900
03000 ?
83100 2
83_00
63300
83q00
83500
Original FORTRAN
U,_ I=U_ Z I ! JgK)
UZZ=U( Z+l._ d_ K)
UZP=U<X,,J_K/
;F (U i1, _T, UI2, RNO, _3, XU I-I--UI ;") )_ (5,)_UI_-U I1), LT, 0, ) U I I :, 5_< (UI J.+U I;:'
× )
ERLL FX(UZI_ZgJ_Kg_I)
PRO I CT ( ': _ J.)= (NHJ.XPRD ICT( i _ _L) +RH M (X_J_KJ-DTDX_.(F(KF'9.L)-F(K.L_J.,)))_B
PRDZCT( ; _ _)= (NH.LxPRD XCT( ! 9 _-.) +RHDU( Z ,j J9 }'_)-DTDXX (r (Kts _)-F_ k£ _ 2) ) )'_B
PRD Z CT ( _ _ 3 )= _'NH ; xPRD ICT( Z _ 5)+RHBU( Z _ J, K)-DTDX>_(F( KE 9 _)-F (KZ_ 3) ) )*B
PRDICT ( I 9 W)=iNH', _PRDICT ( ; _ q )+RH_M< I 9 J9 K)'DTDX_(F (K_ 9 W)-F (KI9 W) ) )_B
PRD I CT ( = _ ..%> = ( NH" >_PRD Z CT ( I , 5) "PE ( Z 9 J9 K >-DTDX_(F<Ka_ 5)-F(MI_ 5_ ))>_B
CONTXNUE
DECODE x
DO 6 I=_IE
RHBZ=I./PRDICT(Zgl)
U (IgJgK)=PRD;CT(_,2)_RHQZ
U _ZgJgkJ=PRDZCT(Z93)_RH_Z
H (IgJgK)=PRDICT(I_q)ARHgZ
EZ(I_J_K)=PRDICT[I_5)X RHBZ -oSX(U(Z_J_K/_X_*U<I_J_V.)XX_+N<I
x _JgK)XX_)
P<X) =_RHHI_PRDZCT(_/)AEZ(I_J,K)
C_|ITZNUE
_ODNNSTRERH B, C, RT I=XL
DO 9 K6=195
PRDICT(IL_k6)=PRD_CT(IEgK_)
CREL BCYkK_LEgJgJ)
C_TZNUE
RN_ (Z_J_K)=PRDZCT(_9_)
RH_U(Z_J_=PRDICT(I_>
RH_U(IgJgK)=PRDICT(I93)
RN_N(Z_J_K)=PRDICT(Ig_$
CBNTXNUK
C_NTXNUE
CQNTINUE
ER_L gUTER(JS_gJE_gKSZgK£_)
RETURN
END
Figure G.5 Explicit Aero - LX (Cont'd)
I
J
£OqO00
104100
lOq_O0
1O4300
lOq_O0
lOq500
10q600
10_700
iOqSOO
10_900
105000
1O5100 5
105_00 C---
£05300 C
105_00
105500
z05600
105700
£05800
105900
106000
106100
106_00 6
106300 C_zx
106400
106500 9
106600
106700
108800
106900
107000
£O71O0
ZOTZO0
x07300
107q00 7
107600
107700
107800
107900
FMP FORTRAN
REPRODUCIBILITY OF TItE
ORIG1NAL PAGE IS PO: q,_
UII=U(II_J_K)
UII=U_I+I_J_K
IF(UZZ,_T.UI_,RNDo (5,xUZI-UZ_)A(3,AUI_-UX1).'..T,0,,) UXI=,Sx(uz£+UI_
x )
CRLL FX(UII_X_,J_k_ XI)
PRDICT(I _ I>=(NHZ_PRDICT(I _ I)÷RH n _ I _ d _ K)-DTDX>_(F (K_ Z)-F(K£_ 1) ) ) _B
PRD I CT ( I g _ )-- (NHZRPRD I CT ( I _ '*) ÷RHnU ( Z _ d_ K )-DTDX_( ( F( KP_ _ -F (KJ. _ _> ) )RB
PRDZCT(If_3?=(NHlXPRDICT(I_3)+RHDV(IfJ_K)-DTDX>_(F(KP_3)-F(KI,j3_S>)A,._
PRO ICT ( I _ q)=(NHI_PRDZCT( I _ q _ +RHDW ( I _ ,J_ K )-DTDX>_ (F (K_ q)-F (K£ _ _ ) ) )XB
PRDZCT( I _ 5)=( NHlY<PRDZCT( X _ 5)÷E ( I _ d_ K)-DTDX_(F(K_ 5)-F(KI_ 5) ) )AB
CDNTINUE
A DECDDE A
DD 6 I=_IE
RH_I=I,/PRDZCT(I_I>
U (I_J_K)=PRDICT(I_)_RHDI
V (I_J_K)=PRDICT(I_)_RHDI
N (I_O_K)=PRDZCT(I_q)_RHDI
EZ(I_J_K)=PRDICT(I_5)A RHDI -.SX(U(;_J_K)XX_eV(I_J_K)xX_÷N(I
X _J_K_xX_)
P(I) =_RHHIRPRDICT(I_I)XEI(I_J_K)
C_NTINUE
xDONN_TRERH B° C. RT I=IL
DD 9 K6=1_5
PRDICT(IL_)=PRDICT(IE_K6_
CRLL BCY_K_IE_J_J#
CGNTINUE
DD 7 I=_IL
RHD (I_U_K}=PRDICT(I_£)
RHOU(I_U_K)=PRDICT(I_)
RHGV(I_U_K)=RRDICT(I_3)
RHDN(Z_d_K)=PRDZCT(I_)
E _Z_J_)=PRDZCT(Z_5)
CDNTINUE
ENDDD_ GIVIN6/R£1/s/R£Z/_/R£3/_/R_/
CRLL DUTER(JS£_JE_£_EE_)
RETURN
END
Figure G.5 Explicit Aero - LX (Cont'd)
G-21
Original FORTRAN
89800
89900
90000
90100
90Z00
9O3O0
90400
90500
90600
90700
90800
90900
91000
g£100
91_00
91300
9£400
91500
9£600
91700
C_XXX
C_XXx
CxX_
KRPRPI=kRPR+I,
CDRZQLIS FDRCE
FXCD=.I_5_DT
DD 3L30 L=I_NLRY
IHi=IH
DO 3130 I=I_H
FD(19I)=U.
FD(Jt_I)=0,
DO 3110 d=_dHHl
3ii0 FD(J_Z)=F(J)XDXYP(J_+o_5A(U<J_IgL/YU_JgZHZ_L)TU(J_IgZgL}_
x UkJ_£gIH19L/)_4DXU_JA-DXU_J_l))
Og 31_0 J=_gJIt
RLPH=FXCBX(P(J_Z)+P(J-_,Z))_FD(JtZ;+FD(J-i_Z))
UT_J_;gL)=UT{Jg;gL)_RLPH_V(JgZgL)
UT<J_ZHiqL)=UT(J_ZHI_L)TRLPHxU(JgZHI_L)
UT(J_ZgL_=UT_J_Z,L)-RLPH×U4J_;gL}
3&Z0 UT<J_iHi_L)=UT(J_£H1,L}-RLPHXU_J_ZHi_L_
3£30 ;HI=I
Figure G.6 GISS Weather - Section of COMP2
G-22
%7_
,,e
90000
90100
9_00
IOOQO0
&OOlO0
£00_00
100_00
100400
_00500
_00600
_00700
100800
_00900
_01000
£01100
i01_00
i01300
_01qo0
101500
10_00
IOZ300
10Zq00
10_500
10Z600
I0_700
i0_800
I0_900
_03000
103£00
I05_00
103700
_03800
103900
zOqO00
10_i00
_0_00
104300
I0_00
C
C
C
C
C
C
C
C
C
C
J.O0
_'00
FMP FORTRAN
REPRODUCIBILfI%' OF T}I_
ORIGINAL PAGE IS P_ _,r,'.
THIS I5 THE SECT%D|_ DF GI$S _ CDHP_ THRT NR$ $%NULRTKD
CORIOLZS F_RC£
DDRLL J=_JHHZ_I=I_IH
IF (I.Ee.1) THEN
_Hi=ZM
ELSE
_N1 = I
EHDIF
VD(I_I) = O,O
FD(JH_Z) = O.O
DO _00 L=_NLRY
HERE THE C_HH_N SUBSCRIPT EXPRE##I_N$ RRE N_T GIVEN
BUT THE CgHPILER I_ RSSUHED Tg HRUE EXTRRCTED THEH RPPR_PRIRTELY.
EQNT_NUK
ENDDD
D_RLL J=_JH|Z=I_IH
_F (I.E_._) THEN
_H£ = IM
ELSE
IHI = I
ENDIF
RLPH = FXC0 _ _p(J_Z)+p(J-£_E))_<FD<4_I)_FD<J-£_I))
UT{J_Z_L) = UT(J_Z_L) ? RLPH_V_J_I_L_
UT_J_IHZ,L) = UT(J_Zl4Z_L) _ RLPH_V<J_ZHI_L/
VT_J_I_L/ = UT<J_L_ - RLPHXUiJ_Z_L/
VT_J_ZHZ_L/ = VT(I_IN_L/ - RLPH_UkJ_NI_L)
CONTINUE
ENDD_
Figure G.6 GISS Weather - Section of COMP2 (Cont'd)
G-23
G-24
91_00
91900
9_000
92100
9_200
9a300
9_400
92500
9_600
92?00
9_800
9_900
93000
93100
93_00
93300
93400
93500
93600
93?00
93800
e
Original FORTRAN
CXAAK
C_XX_ UERT_CRL ADU_CT_DN DF THERHDDYNRHXC ENERGY
DD 3_0 L=_qNLRYHI
LP£=L_I
DO 3180 I=Z,_H
DD 3180 J_lsJll
PL_=PTRDP÷SIG_LPI)_P(J_ Z)
PKi=EXP_Yk_PL£_
PR£=EXPBYKkPL_
CDZ=DSIG(LPI)/%D_X6_L)_D_ZG(LPI))
CD2=/,-C01
TKTRH=COI_T(J_L_IPKI+CD_XT(J_X_LP£)/P_
TT<J_I_L_=TT<J,;_L_DTA(_X_(L_KRPR_P(J_Z)XT(J_L)xPXT(J_)/PLZ
x -_D<J,Z,L_xTETRHAP_Z/D_XG(LI)
TT(J_Z,LP1)=TT_J_,LFI)_DT_SD(J_;_L)XTETRHXPK_/DSZ_(L)
ZF(LPI,£e,NLRY_ TTCU_;_LPI)=TT<U_Z_LP1)+DTXSZ_LPI)_KRPR_P<U_Z)X
X T<J_Z_LPZ)AP_T(J_I)/PL_
3_0 CDHTINUE
C_X_X
Figure G.6 GISS Weather - Section of COMP2 (Cont'd)
, i,_
_:,
$
$
_6f_
¥#
I ;
,I1.il
P :!1
[ ,ol
_1
;04500
104600
104700
104600
104908
£05£00
£05_00
£O5300
105900
105500
£05600
£05700
105800
"_05900
106000
",06£00
i06_00
£06300
",06400
10650C
i0660 u
£06700
106800
i06900
£(_7000
iO?lO0
IO?ZO0
£07300
107qO0
107500
£07600
FMP FORTRAN
RI_RODUCIBILITY OF TI-IE
_)RIGINAL P_GE 18POOR
c
c
c
300
c
c
c
_ERTICRL RDVECTIDN OF THERHDDYHRH|C ENERGY
D_RLL J=£_JI4_I=£_IM
DD 300 L=Z_NLHYHZ
LPR = P(J_)
LSI6R = SI6(L)
L_IGB = _16(L*£)
PMI = KXPB'(K_PTRDP + L_IGRRLP8)
PM_ = £XPBYM<PTRDP + LSlGBALPR)
LDSI6R = DSlG(L)
LD_T68 = DSZ_(L÷I)
LTR = T(J_Z_L)
LTB = T(J_ ZgL_£)
LSDR = _D(J_Z_L)
LPITR = PZT(J_ Z)
col = LDSI6B/_LDSZ_R+LDSI6B)
C0_ = £,0 - C0Z
TETRH = C01_LTR/P_I + C0_LTB/PK_
LTTR = TT(J_;_L; ¢ DTA(L_ZfR_K_PR_LPRALTRALPITR/PL£-
£ LSDR_TETRH_P_Z/LD_I_R)
LTTB =TT(J_K_L_£) * DT_L_DR_TETRH_PK_/LD_IGR
ZF (LP£,E_,HLRY) LTTB=LTTB + DT_LSIGBRKRPR_LPR_LTE_
£ LPITR/PL_
TT(J_;_L) = LTTR
TT(J_;_L_I) = LTTB
CONTINUE
KNDDD
C_HP_ C_NTINUES BEYDND HKRE_ THZS IS THE END _F THE PZ[CE SZMULRTKD
Figure G.7 GISS Weather - Section of COMP2 (Cont'd)
G-25
Original FORq'RAN
_35£00
&_5_00
LDSW00
_25500
&35600
_35700
L33800
_35900
_36000
_3GZ00
L36_00
&36300
[36qU0
hD6500
a36600
_6700
_36600
a36900
_37000
_37Z00
_37_00
_37300
_7400
_37500
d37600
237700
23780t=
a37900
_380d0
_8i00
238200
585300
285_00
285500
285600
_85700
_85800
_85900
_86000
2_6i00
286_00
286_00
286500
_86600
_LISROUT;IIE L_HhH_
CUI|IIDIL_RDCDIt/PL_gJ,I:LE_/tI'LF,_.,TGt73,TL_9_,V!T_5_,JHL_9_,
_AAA_
C_RID R_RR't_ ,_T_RRGE FFD_LEI4 014 _TR_
LUGICflL CLDFL_tsE_FLG,L_*LC
;HTEGER _TY
|_ERL _RRR,_EcEK,UII_TRU£,TRUT,fiR,E_,CC,TI4_@,TI_,
x 7RU,ED|4C|I_TDFCII,ED|IC|I,TRUCZR,PiU,E_TRU,T'{,_E_I,_EK_*t_ERR_
_ERU, _ERU, RE_C _ EX_, EA_, _EHO_ _HH_, D_Ht_, PEFUP _ FEFG||
^ REF(IZ),_DNCk&_)
EQU/URLENCE (FF:_RS(i,,EUP_I_,,Fk_RSEk_)iEUP_q_)
EQU2URLEHCE kR,EDHCIIJ_RRR_DFCI|i_E_OIIC|I/,
_6E_TRUCZh)_I_iF_Uli_TRU_E:,YRU)_TRUTIT'()_
C_X_CRLRR RRRR'_ _T_6LES OR U_ED FDR INITZRLIZRTIQ|I)
DZHEN_I_N EG(ZE_Z_)_FZZ_I_,I_TR(IC_I_)_FFI(_E)_FF_(¢E),
w TEHp_B),TE_(_OI)_D'.'_i_,FIRCRO_&_,I2,II_ERD(IZ),
C;REXT_I_),CC_BR(_),COEL_H(IZ_,COEK_)
COHH_N _EXI*T/ TE_
CxX_X
C_X _NGLK L_'dER COtIPUTRT_DII
C_XA EUP=UPNRRD£ EHI_ION
C_X EDN=DOHNNRRDS EHI_I0||
E_AX TOF=TRRNSHI_SIOt{
C_%_ _EF=RE .EEIlON
CXx_
HCC=I_CLOUD(tI_
t_HER=NRERO(t_
TRUCZR=NCLOUD_t;)A_C_KE><T(LRIl=_CTRU55_
TRUH_tI,K)=TRUII_N,E*_FflUC_R
Figure G.7 GISS Weather - LINKHO
G-26
I ii!
i
£000o0
zooioo
zoo300
iOOqO0
_005o0
Ioo600
10O7O0
£00800
1oo90o
101000
101300
I016oo
1O1700
_01800
_01900
I0_000
I0_£o0
Io_00
io_3oo
10_q00
10_500
10_600
10_?00
Ioz8oo
I_,_900
_03000
103100
_03_00
I03300
FMP FORTRAN
_UBROUTIHE LZNKHU
COllt4OH /RRDCDI4/PL_9)_PLE<ZU),PLKkgJ_TG,T_TL<9)_TSTR(_
_C_CD_Z_RSURF_SCO_Z_RRP_RRH
CDHHDN /CLDCDH/ _HRLE(16)_H_L(£5)_RL_Z_)_TAUL_£6_,OZRLE_16),
TOPRBS
LOG_CRL CLDFLG_RERFLG_LI_L_
_ERL TRUCIR_CTRU55,X_PIO_TN_RERZ_RER_RERR_RERC_RE_AERU_
1 KXI_KX_DKNUfDNHU_DNHI_RERV_EXTRU_TRU_RDNCN_EDHCH_TDFCH_
_UPCH_EDNCH
_HTEGER HCL_UD(_)_NRERO_I_
RERL CIREXT(I_)_TRUN(I_,_)_PICIR(I_)_FIZ(I_I_)_CSC£_£_)_
EDNC(/_)_TDFC<I_),RDIIC(I_)
RDDITXDItRL DKCLRRRTIDNS NOT U_ED IN THE SIHULRTED PDRTX_N
RRE DHITTED FOR BRKV%TY
STRTEHENT_ RBOUT PARRLLKLiSH RRE OHITTED ALSO SINCE LINKHD
I5 CRLLED A_ A _UB_OUTINE NITHIN THE IN_TRNCES D_ THE
DDRLL /LAYERS/ _F COMPS. IN THIS CA_E_ EACH INSTANCE
CALL_ LZNKHD INDEPENDENT FROH ALL OTHER :N_TRNCE_ RND
U_E_ A LOCRL COPY OF CODE HITHIN THE PROCESSOR IN NHICH
THE INSTANCE RE_IOEE. _E_UENCING OF THE EXECUTION HITHIN
THZ_ _UBRDUTINE IS SOLELY DEPENDEN¢ ON THE INSTANCE AND
LDCRL DRTA_ NOT ON RNY OTHER INSTANCES,
DO ZOO LAH = £_i_
o0 100 _ = z_
DD 101 N _ I_HLRYRS
|1CO = NCLDUD(N)
HAER = NAERO(H*
TRUCIR = CIREXT(LRH} X CTRU55 x NCC
= TAUd_N_K} • TAUCIR
TRUN(N_) = X
Figure G.7 GISS Weather - LINKHO (Cont'd)
G-27
?atj
->;.
= ._:,
_o_t
..!_
Z86900
_8?100
_87600
287800
_88300
_88350
_88700
E88800
_9100
_89L)00
Z89500
E89600
E89700
_89800
E90600
zgo?o0
_90800
Egogo0
_91000
_91100
_9£_00
_9£500
_g_£00
Eg_qo0
_9Z700
_93000
_93100
_93_00
_93qUO
_93500
Z93600
_9)700
Z9_800
_93900
_9q000
G-28
Original FORTRAN
PiO=PIZEKD(N)K_
C IN CR_E CODE DIDN=T GD THROUGH PROPERTIES CRLCULRTIDH
C _ET FIB TD ZERO
;F(N,LE._ TII=TSTR_H>/_?3,
_F{N._K,_J TI¢=TL_N-d)/_73.
LF(TN,GE.,_5_WS,RND,NCLDUD_N>,GT,0)FIU=U,
IF(FIO,@T.Z,E-UWI @O TO _60
XF (UCC,_T,U) _0 TO 10_
C_A_UPNRRD RND DDHNNRRD FL_XEDN(N_ OF _INGLK<KUP(N))EDN_N2) RND COHPDE
1_ IF(X,LT,Z,E-0q) GO TO £03
/F(X,GT,ZS,_U) _0 TO _UH
EXTRU=EXP(XX)
C_CLERR LR_ER--PIU,LT0_0E-W
T'¢=_O.EOxX
;T'K=TY+Z,E0
TDF(H,=TE_ITYJe_TY-iT'{el)_(TE3_IT'(+I)-TE3<IT'{))
Go TO 105
£0q CONTINUE
EXTRU=U,
_05 REV(N)=0,
DFB=(ETDP(N_-_TOPkN+ZJ)_6,_67K-UI
F6RRD=DFB_(1,_-EXTRU*/X-TDF_N_)
RN_=I,U-TDF(I:
EDtlVN_=BTDP<N_¢]XRN_+F6RRD
EUP(N)=BTOP(N_RNS-F6RRD
_0 TO _09
• 03 TDF<N)=Z,O
REFCN)=O,
EUPkNp=U.EO
EDN(N_=U,EU
GO TO 109
_UPtN_=BTDP(N_
EDI(tN)=BTDP(N+Z)
_0 TO _09
Figure G.7 GISS Weather - LINKHO (Cont'd)
I103500
103600
103700
103800
103900
10q000
10½100
i0q200
10q300
10L_00
10_500
10q600
104700
i0q800
loqgo0
_05000
105100
105200
_05300
_05400
105500
105600
1057(10
105600
105900
105000
106100
106200
106300
FMP FORTRAN
REPRODUCIBILITY OF THE
ORIGINAL PAGE IS p4)q)R
PIO =(TRUCIR_PICIRD(LflH) + PIZ(LRH_N))/_X+I.K-4V)
ZV(N,GE,4) THEN
TN = TL(N-3)/_73.
ELSE
TN = TSTR(N_/_?3,
KNDIF
iF (TN.GE.0.$53q8 .AND. NCC,GT.0)PID=0.
_F(PID.GT.X.E-4) THEN
AER1 = 1, - PID
RER2 = 1, - _PID_CB(LRM_N)
RERR = SCRT(RKR1/RER_
RKRU = (1. - RER_)/2.
_ERU = _1. ÷ AERn)/Z.
RERC = S_RT(3._flERI_RER2)
11 = -_RERC_X)
E×I = 0,0
IF (X1 .GE. -i80._18) EX1 = EXF(XZ)
IV <EXI.LT.i. UE-30) EXI=0.0
EX2 = ExixExi
DEN0 = Z,/((RERU_flERU_ - _RERU_flERUREX_))
DNHO = _<BTDP(N) - BTDP(N_I)/(XXAERC))X
((RERU - RERUXEX_) - <AERRAEX1))
DNHZ = RERU 9 RERU_KX2
EUP(N) = (BTDP(N)RDNHI - DNH0 - BTDP(N+I)_EX1)A
DENOxRKRR
EDN(N) = (eTDP(N+I)XDNH1 + DNH0 - BTOP(N)XExi)_
DENOxflERR
REV(N) = RERU_RERU_(I.-EX2)ADENU
TDF<N) = _RERU-RERU)_DEN0_EX1
°
Figure G.7 GISS Weather - LINKHO (Cont'd)
G-29
Original FORTRAN
a94100
Z94ZO0
_9_300
ag44UO
_94500
Z94600
594700
_94800
Z96600
_96Z00
296800
_g6900
297000
_97100
297_00
_97300
297400
Z97500
Z97600
298700
a99400
300100
300ZOO
300800
301200
301300
30±WOO
301500
30±900
302000
30_600
302900
303000
303400
303500
303800
303900
304000
304100
_04200
304300
30q400
304440
304500
305300
305400
305500
C×_X_EMIS$IDN CRLCULRTIDNE FOR HRZE LR'(ER_EXRCT IN THE SEUP(N)SE DF ISD
C_XIC _CRTTERING
C_xXEXRCT _DLUTIDN=TNO-STRERH SOLUTIDNxFBR_E FRCTDR<PIU_TRuD)
¢60 RERR = S_RT((Z.-FI0)/_I,-PID_CE(LRM_N)))
RERU = (_-SERR)/2.U
RERU = _£+RERR)/Z,O
RERC = S_RT(RER2_,OxRER1)
EX_ = EXP(-RERC_TRUN_N_KJ)
EX2=EX&XEX&
Cg_AAXFDRGE FRCTOR FOR ISDTRDPIC SCRTTERING
FTNO=loEU
C_XX PIOZ=PIOxPIO
C_ FTNO=Z,U+O,iqXEXT_U+U.ixPIU2X<I,U-EXTRU_+(-i. O3+U.qOI9xPIO+O.6631X
C_X_IPIO2_XxEXTRU_(2,UiT_-o,6_O_P_O-Z,3597XRIO_)xXAXAEXTRU_EXTRU
DENO = (RERUx_2 - RERU_x_)_EX_
DNHU = _BTOP_N_-BTOP(Nel_)/TRUN<N_K)_/RERC_
1 <RERU-RERU_EX2-RERR_EXZ)
DNH_ = RERU m _ERUXEX_
EUP<N/=_BTDP_N)xDNH_-DNHO-BTOP_N_I)aEXI)/DENO_FTNO_RERR
EDt_N)=kSTOP(H_i)_DNHI÷DNHO-BTOP_N}_EX1)/OENOXFTNOXRERR
C_REF<N)_TDFkN) BRSED DN TAD STRERM SOLUTION
REF(N_=RERUxRERU_Z,0-EX_)/DENO
TD_(N)=_RERU-RERU)/DENO_EX1
C_x_ FORM TOP CDHPOSITE LRYER <RDDZTION)
I09 DENO=I,U-RDNCN_REF<N)
EUPCN=EUPCN_%EUP(N)+EDNCN_REF(N))_TDFC(N)/DENO
EDNCN=EON(N_*kEDNCN÷EUP<H)_RDNCN)_TDF(N_/DEND
_F_NCLOUD{N_,_ToU_ CLDFLG=,TRUE,
CAAXX _ET REROSDL FLRG IF CIRRUS CLOUDS KHIGH RLBEDO)
ZF<CLDFLG,RNO,P_U,GE°£,E-_ RERFLG=.TRUE,
C_XXX TRRNSMISSIDN CDt|PUTED DIFFERENTLY FOR 3 CREES
IF <CLDFLG,0R,RERFLG) GO TO 125
C_XX CRSE 1. RTHOSPHERE HRS NO REROSDLS DR CLOUDS THRU HERE
C_xX_ USE EXPDNENTIRL INTE6RRL RPPROXIMRION
TRU=TRU_TRUN_N_k}
C_ PROTECT RGRIN_T TRBLE OUERFLDH
T'(=_0o_TRU
ITY=TY+I.
:F(ITY,LT,&_ ITs=1
TDFCN=TE_(IT'(_+(TY-ITY+I)_TE_(ITY+I)-TE_(ITY))
_o TO zE5
TDFCN=O,
iF(,NOT,RERFLG) GO TO 130
C
a.a4
_._5
Figure G.7 GISS Weather - LINKHO (Cont'd)
G-30
j'_r
_ _l(_
106305
106310
106315
1063e0
106325
I06_00
_06500
106600
106700
106800
107400
107450
107500
107600
107700
107800
107900
X08000
108050
108100
I08_00
106300
zo_qoo
£06500
_06500
106700
£06600
106900
109000
109100
109_00
109300
ZO9_O0
£09500
z09600
109800
10gg00
110000
FMP FORTRAN
ELSE IF (NCC.GT,O) THEN
TDF(N) = 0.0
REF(N_ = 0.0
EUP_N) = BT_P(N)
EDN(N} = BTBP(N+I)
ELSE %F _ X.LT,I.E-_) _HEN
TDF(N)=£,0
REF(N) = 0,0
EUP(N) = 0.0
EDN(N) = U.0
ELSE
ZF (X ,LE. i_,0) THEN
EXTRU = EXP(-X)
ITY = XX_0. • £.
TDF(N) = TE3(ZTY) ÷ (TY-ITY+£) x (TE_(ZTY+Z)-TE_(ZT7_)
ELSE
EXTRU = 0.0
TDF(N) = 0.U
ENDZF
REF(N) = 0,0
XI = I,U - TDF(N_
X_ = ((£,0 - EXTRU)/X-TDF(N)) x _(BTDP(H_ - BT_P(H+I))_
0.6666)
EDN(N) = BTUP(Nel)AxI÷x_
EUP(N) = BTDP_N}_X£-_
EHD_V
DENO= 1.0/_1.U - RDNCN_REF(N))
EDNCN = (EDNCN÷EUP(N)_RDNCN) _ TDF(N_ X DENO÷ EON(N;
IF (NCC,GT,0) CLDFLG = ,TRUE.
IF(CLDFL_.RND,PID._E,I,UE-_) RERFL6=,TRUE,
_F (,NOT,(CLDFL_._R°RERFLG)) THEN
TRU = TRU + X
ZF _TRU .6T, zS) THEN
TDFCN = U.
ELSE IF ((_0,_TRU+Z.),LT,I_ THEN
ZTY=I
TDFCN = TK3(ITY)+_TY-ITY+I;A(TE3(ZTY+I)-TE3(ITY))
ENDZF
ENDIF
\
Figure G.7 GISS Weather - LINKHO (Cont'd)
G-31
Original FORTRAN
305600
_05700
_06aO0306500
307000307£00307_00307300
307q00
30?500
307600
307700
307800
307900
308000
30_050
308060
308070
310600
310700
_i0800
310900
311000
311100
311_00
_1_000
31_100
31Z_00
31_?00
313100
313500
313900
31W000
31_100
314_00
314300
31q700
3iq800
3±½900
314910
3149_0
314930
C
C
C
CRSE _, _|GNIFICRNT RBSDRPTIDN
RDNCN=REF_N)+TDF(N)_RDNCNATDF(N)/DEND
TDVCN=TDVCNRTDF(N)/DEND
IF <NCLDUD(N},EG,U,DR,PID,GE,I,K-4) GD TO 140
CASK _, HERUY CLOUD CDUKR
TDFCN=U,
RDNCt|_U,
TRU=O,
CD||TINUE
_RUE PRRTIRL _UHS
EUPC(N}=£UPCN
EDNC<N)=EDNCN
TDFC_N}=TDFCt|
RDNC<N)=RDNC||
CONTINUE
RDD:NG GROUND L_YER NOT INCLUDEU
C_X_A
CAA_VDRH _DTTDH CDtlPDSITE LRYER (RDD_T_DN)
C_AA_
DO i1_ N=_,NG
H=NG÷I-N
DEN0 = /,U - RUPCN_REF(H)
EUPCN=EUP<H)*<EUPCN_EDN<H)XRUPCN;_TDF<H)/DEHD
IF<H.Ee,I) _O TD _19
L=H-_
RUPCN=REF(H)_TDF(H)ATDFfH)_RUPCN/DEND
DEND=/,U-RDNC<L;ARUPCN
PEFUP =_EUPCNCEDNC(L)ARUPCN)/DEND
PEFDN =_EDt|C(L)+KUPCNRRDNC(L))/DEND
GO TO 1_0
.19 PEFUP=EUPCN
PEFDN=U,
-_0 FE<N)=FE(N}+CKLRH_<_EFUP-PKFDN)
11_ CONTINUE
100 CDIITINUE
_00 CDNTINUE
c
C _RVE _TRRTD_PHERIC FLUXES NOT INCLUDED
C
Figure G.7 GISS Weather - LINKHO (Cont'd)
G-32
110300
110400
1£0500
_L0600
_10700
£10_00
110900
IILO00
liL_O0
I££350 _Oi
II£_00
Ill500
ILZ600
_I1700
ILISO0
_£1go0
LL_O00
_i_£00
il_O0
I£_00
Li_500
_I_600
_£_?00
II_800 IL6
LIZ900 iO0
L£_O00 _00
FMP FORTRAN
REPRODUC_ILITY OF THE
ORWJNAI, PAGI", l,_ P _ _'
IF (SERFLG) THEN
RDNCN = REV(H_ _ TDF(H)ATDF6H)_RDNCH_DEN0
TDFCN = TDFCNxTDF(N)XDENU
EHDIF
ZF(NCC,HKtU ,DR, P_O,LT,£,0E-_) THEN
TDFCN = O.0
RDNCH = 0,0
T_U = O_O
KHDIF
EUPC(H_ = EUPCH
EDNC<NS = EDIICt|
TDFC(NJ = TDFCN
RDNC(N_ = RDNCN
CONTINUE
DENS = _,U,'<_,U-RUPCN_REF(H))
EUPCH = EUP(H) * ((EDH(M)XRUPCN+EUPCH) x TDF(H)xDENU
_F (H,NE,Z) THEN
RUPCN = REF(H) ÷ (TDF<H)XTDF(H)_RUPCN_DENU)
L=H-i
DENU = I,/(Z,-RDHC(L)XRUPCH)
PEFUP = LEUPCN • EDNC(L)_RUPCN)_DEN0
PE_DN = _EUPCN * EDNC(L_RDNC(L))xDENU
ELSE
PEFUP = EUPCN
PEFDN : U,0
ENDIF
FE<H) = F'E(H) _ _PEFUP-PEFDN)xCLK_H
CD/|TIHUE
CDtITINUE
COIITIHUE
Figure G.7 GISS Weather - LINKHO (Cont'd)
G-33
i APPENDIX HCONNECTION NETWORK SIMULATION TOOLS
H.I SUMMARY
Two computer-based tools were developed as an aid to the study of
the various Connection Networks. The first was a functional
simulator. This functional simulator supported evaluation of the
Benes, the single-layer Omega, and the simple double-layer Omega
networks described in Appendix B. The networks could be exercised
in a number of modes including random inputs and p-ordered inputs.
Section H.2 below discusses these capabilities in more detail.
The second tool was a stochastic analyzer (see Section H.3). This
tool used the probability of addresses occurring and the probabi-
lity of requests occurring within the network to predict blockage
within the network. Although this approach precluded actually
observing where specific blockages would occur for a particular
input situation, the tool was felt to be necessary because it
would be unreasonable to run all possible input combinations. The
stochastic analyzer was used to evaluate both the single-layer
Omega network and the double-layer network which included
inter-layer connectivities.
It was noted in Appendix B that both tools gave comparable results
when run on the same cases. This correspondence gave confidence
in the results obtained using these tools.
, + %<,,
i ''<,
__ >+,::'_
I :o_"
t
H.2 CONNECTION NETWORK FUNCTIONAL SIMULATOR
H.2.1 Model
The CN simulator is designed to simulate a CN in which requests
propagate through at the speed of transmission delay in cable and
combinatorial logic, after which the path is locked up for the
duration of the EM cycle. After an EM cycle, the nodes involved
in this path may be unlocked if they are not involved in another
EM request.
In addition to nets with the connectivity of Benes networks and
Omega networks (see Appendix B), there are options on the amount
of redundant paths supplied. There can be twice as many ports on
the processor side as there are processors, or there can be just
512. The EM module ports can be spread across the entire 1024
ports on that side, or they can occupy the first 521. The simu-
lator basically has a 1024-wide network of 2 x 2 switches.
The number of CN-clock cycles per EM cycle time can be adjusted
from 1 to 9 by an input parameter.
H-I
Each simulated processor has a queue of up to six memory requests.
The Nth entry in this queue may be either a set of "S" random EM
module numbers, with 512-S of the processors having null requests,
or the entry may be a p-ordered or a p-q-ordered vector of EM
module numbers, with 512-S of the processors having their requests
nulled before the program starts. S is an input parameter ranging
from 0 to 300, or equal to 512.
The four-diglt seed of the random number generator is included in
the set of input parameters.
H.2.2 Simulator Controls
The input commands accepted by the functional simulator are listed
in Table H.1 below. Some of these inputs are optional and have
default values as indicated.
Table H.I
CN Functional Simulator Input Commands
Command De scr ipt ion
Fn Type of Network where n is the sum of
0: if a 19-1evel Benes Network
I: if a 10-1evel single-layer Omega
network
2: if a 10-_evel double-layer Omega
network with alternating priorities
4: if processor M is attached to
input port 2M
8: if EM module N is attached to output
port 2N up to 511 with the other 9
attached output ports I, 3 5, ... 17.
(If no F command, F0 used as default)
Command
An
Description
Algorithm to be used within each 2 x 2 node where n is:
0_ the node gives priority, in case of conflict,
to the lower-numbered ("upper") input on all
one-sheet (single layer) cases, and to give
priority to the higher-numbered ("lower") input
for the second sheet in a double-layer Omega
network.
i: the node sets a straight-through connection in
the case of conflict.
2: the node alternates the priorities between
upper and lower input ports on alternate CN
clocks. If this mode is chosen, it is recommended
that the number of CN clocks/EM clock be odd
(see Tn command below).
H-2 }
Snnn
Tn
BR
nnnn
Table H.I (Continued)
(If no A command, A0 is used as default.)
Command which causes all but "nnn" of the 512
entries in each queue position in the processor to
be erased. The choice of which entries to erase
is random.
(If no S command, S512 is used as default. This
corresponds to no erasures.)
Command which sets "n" cycles of the CN clock for
each EM access time.
(If no T command TO is used as default.)
An optional command that signals the "bit-
reversal" of processor number to TN pork number.
That is, if BR, then proc. 1 goes to port 256,
proc. 2 goes to port 128, proc. 3 goes to port
384. That is, proc. 00000011 goes to port
11000000. Processor no. 00010111 goes to port no.
11101000, etc.
A four-digit number sets the seed for the random
number generator.
Pnnnmmm Sets a p-ordered vector into the next entry across
all processor queues. The entry has an offset of
"nnn" and a skip distance of "mmm".
Qaaassskkkxxxqqq Sets a p-q-ordered vector into the next
entry across all processor queues. "aaa" is the
offset to the start of the first vector piece, sss
is the skip distance within pieces, kkk is the
length of each piece, xxx is the number of ele-
ments omitted if the first piece is shorter than
kkk, qqq is the skip between the end of one piece
and the beginning of the next.
R Sets a vector of 512 random requests into the next
entry in all the processor queues. (The seed for
the random number generator should precede this
command.)
Lnnn This command imposes a limit on the number of CN
cycles through which the simulation will run.
Termination will be after "nnn+l" CN cycles.
(If no L command, L047 is used as default.)
H-3
CWarning: Although the input is free/form, in that the sequence of
commands does not matter and any number of intervening blanks are
allowed, each number must follow its command without any
intervening blank, and must consist of exactly the correct number
of digits.
Following all commands, any character (such as "X") that is not ?,
E, N, D, or the first character of any valid command, will termi-
nate the input. The rest of the card can then be used for comment
which will be printed out on the first llne of output.
H.2.3 Simulator Output
Figures H.I, H.2, and H.3 are examples of three, typical CN Func-
tional Simulator outputs. These examples happen to use p-q-
ordered vectors as inputs with piece lengths of 31, I00, and 30.
The cases were taken from the explicit and implicit aero flow
code. Two of these cases are in mesh sizes as exhibited in the
listings supplied by NASA, and one of the cases exhibits the full
size.
The first line, of the printout ,,hlch begins with "?END" prints
the input commands as previously described. For example, Figure
H.I shows (on the first line):
T2
BR
FI4
(2 CN Clocks per EM access time)
(Bit reversal of processor number to CN port number)
(Double-Layered Omega Network w. alternating priori-
ties (2) + processor M attached to port 2M (4) + EM
module N attached to port 2N.(8))
Q047 1 31
047
1
31
1
409
1409 (p-q-ordered vector with
offset
skip distance within pieces
length of each piece
number of elements omitted if the first piece
is shorter than 31
skip between the end of one piece and the
beginning of the next.
The next several lines of the output summarize the simulation
conditions specified by the input commands.
The remaining output, summarized below, is printed at each network
layer at each CN clock.
ist line: Number of items left in processor queue before begin-
ning. Does not include items picked up by bumping the
processor queue pointers.
(Text continued on page H-13)
H-4
e # OQOOOQ#Q#eOO##QeOQOOQ# Q # #
#e@ Q e e¢# Q @ e_eeeeQe • # •
I_E_RODUCIBII,ITY C)_' THE
o r)l_T{,F_ A "._,'" t': ;,+.....
_n
0000++000+00+ 000¢$+@0+00
+00@++++¢+000 0++0#0+#++4
5 _
',.) 0 _
U ',.,1" ;,iJ
_uJ D
_+0 0¢0_@0¢@0+++00+Q ¢Q@¢0+¢+¢00 _
+¢0¢+¢++++40¢0"+ 0 + ¢¢+++'
+00¢+##¢+¢+++00 +0+ 0++00
_0"0_,_0 _0 '++_
¢# 'I?00+00+++#0¢0¢0 ¢@¢ #
#+ +¢0¢¢¢0000++00++ +#0 +
O0 +@00000@¢+¢+¢+0+00000+
_@_oO__ _D_ _
I
e-t
0
.O
m
U
t_
H-5
o_l_lllllllllilllllll II l_lll_!
_ )
¢+ 0 0
I_11111111111111111 II I_lOl_l
II Illlll II I_1111 I I I II I;3" I0 I _1
.¢,
III III II II I_11 III I I II I'_'_11 I1_ I
") ?
I I l_'ll_l I I_II_'I I I I I I I If II'- Ir- I _.I
II
II
I !
! I
III
o
I_11
<1 _i @ l) @ •
1111,01011111111 II II II II
1,_ .-t
¢
II II
o) i,II,t+,
IIIPI
+-+)
+
II III
"'I
II II
.+
)
III
I I I
• (1
¢ + •
I I I I_+_,, I. IIIIIIIIIIIII1_111
_ ,It
III
II I
I I
IIi
I I
II
I I
! I
0
II'l_l_'#lllllllll I
llI )m_lllo+lllI II_ I
p,.
I I
I I
III Ioll$_lll$11 I _ It _
# @ ,_ o
III lllllllll_ll I _ _ II,_
III I!lllllll_ll I JJ + III
m _
0
_ _ _ J
H-6
C
0
U
0
Ul
tO
n,
.4-)
o
JJ
-rt
L)
@
t_
t_ J +I+ _il j •
.,,+ ,kr, ¢ A{,t,, fi] t,, _')R
II,I, I Illlll
Illi IIIlil
I111 II1_tl
III| ||i.oi|
g,
¢
_1_1 ll,-IP-t!r. ,-4
I II I_1 .+li
I+r+
• <i
ill I I 1 _"_ 1 _t"J
,-.i u+l 4"
I11 f_lllll
t
,i 0
ill Iptlll_l
4'
Ill 41
I I I I +.'-+10"10+1
o
4' +..+
P _ e
I I lll ll_i
I i I P"" I i"+ 1.1) I
• o ¢+
II IIIII I
)+i Illlll
li II Ill¢|l
o
IIII I i I I_l_llpI ,1
.+'l %1
lllllllll_ i
+l'
l lllllll
II IIII11
I1 Itllll
.....<
,-4
,.-..i
++ x
II Iliitl
tl I11111
II IIIIII
IIIIII
tl
I0"1 II I IIII
g
I+1 i',¢.l I I I I I
_ o
lot I_1 IIIIII
¢ t,
I II III I,"_11 I I
.0
.4"
_<i
lil ill i_,l,_ _+ ++
o
u
2
o"
M
o
°r4
It- 7
o +:+
o
o r_
00
o 4
o
g
3
o
Z "_
*1-- xO
tLl_O_
,,),
_ e
I--ulmz
_.-_°°
I-I-8
0
CU
f._
0
0
_.0
O_
0
0
4_
t_
,-'t
°_--t
U
04
0
-,-I
B_
%}!,
.,&
%
.L
= I_ i, _
./
l,t
_°.
.l:l'- O'o ._PO _t
o
o
_NNN _
N_N_
Uo
_N._O_ _W _ Ill
RF,i'I_,.')I ',dr_llJT,, _ , _,' . ',_ ;
'"kl,:;l-",,'A;. : ,, , ' ""
II Ill
I I
I II
II I I
|I II
III IIIIII
I!
I I II1111
I I
|I lilt|
I I
! !
_1111
IIII I
II II J|
I I I£1
I I _1 II
_J
III II
I1 II
0111 II
Io! I I | I
4"
I Ill I I I 1
t1111 IIII
Illll IIII
u_
o_
o
o_
_0
r..
u_
|
o_
I I .J.J
ULI._J
L;O_<
Uo
3E,.O
I I }--
u'o
_lzo
_0
I II _:D
m_
C
8
O
E
C
0
0
0
_0
o
O
D
b_
r..)
O
H-9
,-4,
,/
H-IO
2
0
E
Z
u
D_
++_+
i l
,].+,*
• • OQ
4_J4 NQ@ .f
m
ooo
g
oo0
g
oo0
IIIII I II I IIIIIII
IIIII I II I IIIIIII
IP_I l I I
o
l'Xil I I I
41
ml Iml I
4111 i_i+., ,,.,
• II
III ol
+.i
o
III II
II llml II
o
II I_I tO
II I_ II
¢
II I_ I
II I
III I
II II_ III I
II I l.iil III I
o
I IIIIIp_ll I I
-1"
o
I III IIII II
@@@ 0¢
_w
w_ J
I ,'_ I fu I Ixi I I
i,.i
,,...i
I ¢ I,,"+ I + I _--_4" 4" I%1 4 +
1o'I¢)
III + _i
I, ¢
o •
I @ I I,... I I,,.. I +
I Iml+l'_'lh
i'<l _i ,..,i ,..-i O_
II I I,._+ I n
I I I I f_O_ 14"
I i i I _1 i._lm i #.
I'%1 I,#'i
14"111 I I I+1 i I! 1 _'ll I I I I I_11,"+
I'_11 I i I I I I 11 11 I_1 11 I I I _r Il<m+ _ UI_"J"_
uo _m
t.)
li'_ll I I IIIIIIIII Ill II II_14*I I I I-,.-
IIII
R_RODUC]BI_ITY OF THE
O_IGt_AL PAG_ It3 POOR
IIII IIII IIIIII
IIII IIII IIIIII
I_II IIII
II
l_ll _III
o
_ I l,+U llll
1, o
li, l I
I Ioi I I
o
I I I I II
I!
I1+_
o
I I_I
"+'i
0
I Iml
0 +
m_
lllmll
0
IIi_I II
ll_lll
0
li,_l I
,.-+
I I
II
It
i
_111
o
_J
o
II It
I III I
II III
II III
I--4
el.,
El
1,4
i,,.,
.,,.4
M
+2
0
m
N
.=I
I II
I_11
o
I I II
H-ll
IIII
• , .... ; t _-+; ++ .- +....... +
III
III
III
I I
I!
I !
I I
II
I I
llI
I I
IIII
IIII
Illl
II
II
I!
I I
I I I
IIII
II I
III
III
III
III
II I|I
III
Ill
Ill
I 1 I
II
II
II It I
I I
II I
Ill
I I
! I
I I
III
III
III
III
II
I !
I I
! I
III
I!
II
! !
I!
REPRODUCIBILITY OF THN
ORIGINAL PAGE 18 POOR
2d and 3d line: Report of distribution of EM conflicts (pileups).
(number of pileups) x (length of pileup), from lengths of 0
through 15. Any EM module with a pileup of i0 or more will
have a line stating its module number and the size of the
pileup.
4th linez "On the nth cycle"
5th line: "There were sss successes in rrr requests". For ver-
sion A, the number of successes listed in the first report is
for the first layer; the number of successes listed in the
second report for the nth cycle is the total for both layers.
Next 32 lines: 512 entries, one for each processor. At each
entry we find "-" if no request was made, otherwise the EM
module number of the request, prefixed by "*" if the request
was granted, by "EM" if the EM cycle is still running, so the
path is locked up.
U
H.3 CONNECTION NETWORK STOCHASTIC ANALYZER
The Connection Network Stochastic Analyzer is used to compute the
p-obabilities of input, output and blockage for each switch across
the connection network (CN). These computations are then used to
determine blockage at each level, and finally to determine total
blockage. This tool was not developed to test the performance of
the CN under specific conditions. Rather, the question raised was
what would the effect of the CN be on the average. An initial
assumption was made that the inputs to be evaluated would be
random permutations of the destination addresses. Under this
assumption, no blockage would occur due to simultaneous reference
to the same destination. Although such a situation will actually
occur, it is a misleading situation when studying the effect of
the network itself. The functional simulator did allow consider-
ation of such simultaneous reference situations.
H. 3.1 Model
The Stochastic Analyzer was implemented to study the single-layer
Omega network and the double-layer Omega network with interlayer
paths at each node. An example of such a network is shown in
Figure H.4. In this figure 8 processors and ii memories are
connected. For the purposes of this model, the II extended memory
modules are "spread" as evenly as possible across the output ports
of the net (i.e. with 16/11 steps between each connection.) This
mapping should be equivalent (although it is not the same) to some
of the mappings discussed in Appendix B.
H.3.1.1 Input Probabilities
Since only random permutations are considered, each destination
port address (for those ports with memory modules attached) occurs
with equal probability. As pointed out earlier, there may not be
as many memory modules connected as there are output ports on the
network. As a result, the probability that a specific bit in the
destination address = 0 or 1 is likely to vary from bit to bit.
H-13
PR1
PR4
aO
al
M2
M3
M4
MS
M6
M7
M8
M9
MIO
Figure H.4 Omega Network with 8 processors and II
Extended Memory Modules
H-14
Ai
For example, in Figure H.4, the probability that the high order
bit of the destination address = i is 5/11 and the probability
that that bit = 0 is 6/11. This probability affects the proba-
bility of an address occurring on one or the other outputs of a
switch.
H.3.1.2 probability Computations
The computations performed by the analyzer are based on the
probability of occurrence of each possible input combination to
each switch or node in the network. For example, consider the
switch marked A in Figure H.4. For this example, assume that the
network is a single-layer Omega network. The probability of
blockage in that switch is the probability that the inputs are
either both 0 or both 1 simultaneously. If P(INPUT) is the proba-
bility of an input request occurring and if P(0-BIT), P(I-BIT),
P(2-EIT), P(3-BIT) are the probabilities of the high order through
low-order bits of the destination address, then the upper input to
A exists with the probability:
P(A-UPPER) = P(INPUT) x P(0-BIT=I) (H.I)
Similarly, for the lower input
P(A-LOWER) = P(INPUT) x P(0-BIT=I) (H.2)
Then, the probability of blockage in switch A can now be deter-
mined.
P(A-UPPER=I) = P(INPUT x P(0-BIT=I) x P(L-BIT=I) (H.3)
P(A-UPPER=0) = P(INPUT) x P(0_BIT=I) x P(I-BIT=0) (H.4)
P(A-LOWER=I) = P(INPUT) x P(0-BIT=I) x P(I-BIT=I) (H.5)
P(A-LOWER=0) = P(INPUT) x P(0-BIT=I) x P(I-BIT=0) (H.6)
P(BLOCK-IN-A) = P(A-UPPER=I) x P(A-LOWER=I) +
P(A-UPPER=-0) x P(A-LOWER=0) (H.7)
Substituting known values:
P(INPUT) = 1 (assume all inputs active)
P(0-BIT=-I) = 5/11
P(I-BIT=0) = 6/11
P(I-BI_=-I) = 5/11
Then :
P(BLOCK-IN-A) = [5/ii x 5/11) x (5/Ii x 5/11) + 5/11 x 6/11)
x (5/11 x 6/11) = .104
Using similar techniques, the probability of outputs occuring on
the outputs of switch A can be determined. This sort of computa-
tion can then be carried on through the network, taking into
account the probability of blockages and the probability of the
corresponding address control bit.
H-15
H.3.2 Anal_zer Controls
The user inputs the number of processors and memory modules in the
system as well as the number of switch levels (up to 10), the
number of input connection points, the number of active processors
( total processors), and the number of layers (i or 2) in the
network. Using this information the analyze_ builds a table
representing the connection network. This table provides for
processors to be mapped onto input ports, outputs from one level
mapped onto inputs of the next level, and switch outputs mapped
onto memory modules. Each switch's input probability is used to
compute its own output and blockage probabilities.
H.3.3 Analyzer Output
When the calculations for each switch are completed, a listing is
prepared which fully describes the network analyzed. All of the
user-input information is printed as well as processor and memory
mod mappings. Total blockage and blockage for each level is
printed, as well as each switch's output probabilities.
Figure B.5 shows an example of an additional output which summar-
izes the results of a number of runs. The output for each run
specifies the number of processors, of memory modules and of ports
in the network being evaluated. The number of active processors
identifies the average number of processors actively presenting
requests to the network. Cumulative blockage probability is the
probability that any request made is blocked somewhere within the
network. The number of inputs per switch identifies which type of
network was run. A 2-input switch is used on the single-laye£
Omega network. A 4-input switch is used on the double-layer Omega
network with interlayer communication. The line identified as
Probability of Blockage summarizes the cumulative blockage at each
level through the network from the processors (on the left) to the
memory modules (on the right).
H-16
+I
RT_PRODUCIBILI_ OP TI-IE
ORIGINAL PAGE IS POOR
o U
,_, |-
I?I ,'g
W
i+
i-4 Lq
-I I.
,v, O.
Oa 14
rl f,J
t_ r'1 6.
tL I'1 Q
c.,
i* ot_
a. 03
Z
rJ I-
• ,I .J
i-i
'.q t,"l,'I '21
|_;.j cr
;'J P+ +*
_1 1:3 W UI
_. ., I._ b./
f,J Ut _ 'r
-4 ?" I_ "-' U
L._ ..0 n U I -
l.q H bl r'l i-¢
W I- bl -I "_"
.J U l_l L_) bZ
,°_ WU
UI U I- r_
_ _ W ,r
W ne I- Z_ -I m
r¢ M _ t.j ;D -++
O. _ I.I
.i
.4
_m
O
V'm
O
,._ Ld
td
,_ U
u'i I--
I.C
::.° u_
,..i 111
i.i l-
I,I.+,'_b.
I"I 1.I
f+'l 14 I< +. 'r+
t_ rJ I- -"
.1 +..I
;:_ f&l i+,4
r-s 03
J t_ ,, ,._ ,v,I'J
o t_l_l I_ ,4
_ I- I- f..I _ t:'l
-. 0,. I'J _+
I- '.."IiCl _ ,*
,-I n z'J bl _ '.d
_, _'1 .o i1 I:] t..l U
VI *-_ bl l-I I-
IA I- bl .-I ,-_
._I U W m "z
m _ 0 "" _" b" b.
Iq l_l I-
_. IA >- I.l.l _ n"
hi _ I- _.._[ W
_.. t....I ..'-t ;D +..+ _ ,'z,
_. _ _ _. I.. _'. "-
r_W ZU _ _
_j iJ
L3
/ /
0 1.3
il i:-
ca
m
n i-i
c_ tc
u. u+
'. "1 ',_
U'I '1
_,'i l-
f,j , + _.v,
I'.| i_l ,.._
• . UC
I- U.
.4 .I _1 ,.9
pel ,++ I- 0
03 Z)
• _ 13.
03 _ .-,
r'i :p H
L"I I,_., _ t,,, u'I
_el c+, rs o j,.j
i- z m .I_
Cr+ +-+ ,, ,, _ .0
.).j _ :- :9 u'm 0
r'rl t_ 1'3 l- " ._
o,Z 1 -I _'M
,'_ m '+' =' i:_
,-, m I_
:*i o. ,. P. i_ PJ _ r,_
::) I- I- "+I _ t_'1 ::,
I- t."l [] W .- ['4
=:r :D '"- ,. '_ _ ='
':3 0 :',j s.m'l_ W ,'I +,--I
,<
:_ I+.I I+ b'l -I _
•-I I,-) W m _
,._ ,, _ W I-I Lq ".
tq r', _ 1_ W '-4 -Pt
L'I _ _ "" _ .Z:. I_ .:., 4-)
W _ I- +"++ .-.I W t..}
":_' =' Lf'l
• L
t_
,:,-
t_
U
,-1
0
'4
:L
H-Z7
_,m
,:_+'
r--j
i!! _
I _+.
hi
w
I+
J
¢0
¢e
l+
I+
:3
El
l+J
U
I+
W
|- ++,,.
14
,.,I t:l
1.4 I*
_ Z
I'I geI 14
I- o
_ M
(*J I + 2
,( J
¢..t
Ift :+:, £2+I
I + ,'+I I_ ,'I
• t= 0-
:.;3 P') W ._
,I = +.' M ::I::
:+¢ -*E_ _ U U
I + Lq +J +"+
.J IJ l,J m _
++ :::1 W L) a+¢l
M me i..-'-. J bJ
+'0
+?+
I:)
p.
(0
M
po _1
I".l J
e._ ffl
(:3
I-
ra tl
I£1
I.el
2
.£ ,=
U _J
J J
_u
P+ _, :, I J :_ o
o L_ . IE ,
i p+ _++ L+'I
s.4 I " :::, i,< { - C'I
_ir31 ,+ t3i_J ,_ ,
I + • _ I_ |+ • :,
_+ _ + +'+ i3.- W +._,
•+ "e, l- = ,v,
.... -. .< ...... - ._ .o
,4 +J - E+ .-I J
l,.r ,_ 133 l< _., o
• * (_ _t " J I+_I 0. ,- • "+ ; I r'J l'i
L+I _ I ,_ lSt., El I _.'l p,
E
.4 _ +.I+ .3 +4-:
LI | . Ia+l .,-_+ _+4 _ t,a I- l'l J _+i _ I_
J U _ m 3[ JUW_I_ A:6
_ t::l ":" _ "+> _ ,=, _ t2 = _ ,_ L. :'-
t_ U I+ • IA U I- +
14 {1: I- :_ J I_ _l E I-+ _ J 1+) J_)
• • 0
_J
.5" _
U
t.I
J
i+
.1
m
H-18
tk
LJ
W
:_° _.
I-
1 I/i
M I*
,v, ¢,
I11 i_
e, a',
D1 .o _
.1
I/i -t '3n
I- :I" _
,4 7 'JL
:,l D W u_
• 4 Z _tJ
b.I I- _1 .,J ".{
,,.J L.,n Ld ,'0 .q
o- _ t.d U
_.'- W'_W
W _ I-" D' -t _
'Y W "" L.,,t _ _
,-, -'- u..u_ U
f-.
|..
i'J
o
,4
.4
o
i,J
,4
,4
o
::_ L,I
oW
I--
_,'1 i i
0", .1
0
I ,'r"
•) n
t"
t'-
4_
o
,:::,
_J
,.0
,_, LI
ill
14
r_ I-
I11 ._
,', ul
.4 .J
1.,I .t n ,4
I- .T_ _-" LI
• 4.-¢ :.1.
l;*i i::9 W ..
_l. -. I._ b'l
L,'I.* 13 _ L_ U
U'I m-_ t.ft _ I-
I.I I- kvl ..1 _
.J U W _'_
•- _ W '...l b%
I_ U I-
_1,"- W_e¢
W _ I- :_ ...,I W
_K(IDVCW,Ih],I_/OF '_
Ji I-
',1 * ii
PJ _ '_
;',"t ql
m :'-' L,J
4 _ .4
Tt 4 t3
•::'* Lq I * •
Z. _ '_' i.J
_f" "0 a'J I-
_: ,4 .J
<:::, fro H
m
t,q ;:i. ,..4 _ ,J
L'I k_l ._i O .4
c_ I- l-::t I_ LI_I
,L, 0,. Iq o-I
I- "I 13 W.,
4:, L_I ,, r_ 13 U IJ
bl H UI L1 |-
•.J U Ld m _
r.-. *. _2_ |.I U UI
.'1", I_ _ ":" D bJ
U'I U l-
Ld_l- _. J W
'_ ,_" W Z U _ :_
f¢l
FI
I"1
,.L,
4-,
J-
.4
,z;.
i4
,c.>
r'l
:.z,
,,3
.t
0
U_
c_
e_
,:.;
o
J
I-
L
o
0
o_
4.1
0
,,..I
t._
H-I9
MICROCOP"f RESOLUTION lEST CHART
NAIKtNAL L_tlHLAU L)I _ANpAKL!_ _q(_"
'_i£ PJ)DUC!B[I,I_" OF TIlL
} A,J, IS P(N)R,(U_tqi"4M,) ""
I.tl
hJ
t-I
I-
t-I
.1
t't
¢11
m
D
13.
I"
I:
,.fl
(.j
ul
FI
IM
t_
.=,
,D
i,'l
,:..,
ffl
IJ
.t-
t.q
Lll
,:., H
I-
,_ .J
o ;
I,I
W
I- LL
J t#l
t4 I-
,Y, =,
I_ I'J
;J. ,-'f L
Ill
•3 .' I_
• _ ,. __
f'. I1:
'.9 If. '* f"J _ _'J
_1 tq I El I
I_ |- b'l _ ;;'!
_;, I_L l"J 14
I- LI _ Ld *.
r".:3 _. ,. 19 I.q
,=, El uJ b) '£ 1_1
L- .4 " t.l: _ :1:
E'._ L'I ,* £1 O I.I LI
• ,,,I a-, l,'l £I l--
I.q U I-
W Q_I- _. JW
r.. U E] 3 H :3_
_.r, El = ¢, I- "r" .T_.
.:.> I:I: W _ U :3 3
,.I I_- _" '_ _: U t"
,L>
(,.
f"l
i¢1
,,.3
.t
• ,r
;:., :'-' W
i- u.
,_ J VI
_., IIl ZI
ID ii I'.
{ ,ILl
._ IJ] I" *
.I _'_ ;'J I-
• q: l J
I;l m
1'1 _ ,,, I"l m , 1
4 I,,T ,4 0 ,4
:, l- I- _'1 I1:;.'1
I1. ;'J H
I- L'I I"1 I,I ..
,._1 3 IL *_ 19 IYl
-4 ,4 = E _ ::l:
,'_ :.'I ,. [3 1.1 IJ U
b'l _., b'l L] I-
la.I I. t q 3 ,-I
.J U L,I _ "_
,-, ,. _ W U '.d
r-. ,. _ o z c_ w
if: FI "" _ 5_ la.
i,i _. i- ._ j W
.._ _:W z U 3 3
3
J
E ,1-
U U
n 1.1
J J
h h
n 13
i- i-
j J
l*l i-i
gl "¢'
gt Q
I"1 n
u. J-
..j ,...
I;i
i ftj
• _1 '_ I_J
iji I(i
L,
,.)
i _.
0
• U
_':_ .is
4J
.2 g_
.9 1"4
=t"
N
I"IL._ iI_
0
:-' .i-{
.IJ
-,.I
-I I/}
,,.j RI
(J
._ 0
.IJ
Lrl
1,4
.--I
U
L:]
j I
L
:1
J
I:]
u_
i
H-20
APPEL_DIX I
BENES ANp OMEGA NETWORKS FOR FLOW MODEL PROCESSING*
INTRODUCTION
i
Parallel processing machines gain time at the expense of addition-
al processing elements. However, parallelism entails processor
access problems. The major assumptions of the NASF Flow Model
Processor are:
i) There are 512 processing elements and 521 extended memory
modules.
2) Some hybrid of a Benes or Omega network is used to connect
prucessor elements to EM modules and processing elements to
processing elements (See Figure I.IA and I.IB).
Roughly, the more processing elements, the faster the machine can
run, given a program which exhibits a large degree of parallelism.
If there is a prime number of memory modules -- 521 is prime--then
corresponding column elements of a p-ordered vector are stored in
different extended memory modules, making it particularly easy to
access a column at a time (see Figure 1.2). However, in assuming
521 EM modules, we presume that matrices are to be stored across
the EM's. It may be beneficial to be a slight bit heretical and
ask whether matrices stacked into a single EM might not be more
effective in executing block transfers to local memory. It is to
be remembered that a single processor will most often want, say
VECT(I), VECT(I+I), and VECT(I-I), which may be stored
concurrently in local memory. This, however, seems to be mainly a
software problem.
The choice of a Benes or an Omega network is a pragmatic one based
on required hardware and expected transmission time. (See the
chart on pg. 109 of Ref. i). Ultimately, we settle for Benes and
Omega networks because they appear to be the most efficacious
solution presently available. While Benes (2, 3, 4) has shown
that for the network which bears his name, there exists a non-
blocking control pattern for every arrangement of inputs to
outputs, practically speaking, computation time is prohibitive.
Thus, the concept of distributed control arises; this concept
works especially well with an Omega--since at the ith level in the
network, there is a relatively simple mapping between the ith most
significant-bits of two or more addresses and the state of the
switch.
*Originally submitted in September, 1978.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I-2
Figure I.IA
Bene_ Network (n=4)
67 89
1415
Figure I.IB
omegaNetwork (n=4)
I-3
a64
a52
a35
(123
a65
a53
A
( a41_
a24
a12
f c17_I
V
a54
Ct42
a55
a43
a25 ( (z31
Ct13 a14
( am !
(144
a32
a.15
a62
a45
a33
A
(a21)
v
a63
®
a34
a22
EM1 EM2 EM5 EM4 EM5 £M6 EM7
Storing a p=5 Matrix in a Prime Number of EM Modules
Figure 1.2
Storing a p=5 Matrix
in a Prime Number of EM Modules
I-4
J
I
1.2 ANALYSIS OF TWO-LEVEL OMEGA NETWORK WITH INTER-LAYER
CONNECTIONS
i
The object of the two-layer Omega node-level analysis is to obtain
the blocking probability at each level in the network from its
input probabilities. The output probabilities can then be calcu-
lated from the input probabilities and blocking that goes on at
each switch. These outputs are then re-ordered by the connectiv-
ity of the network, and they become the inputs for the next level.
An important fact in this analysis is that each switch in a given
level has the same set of input probabilities. Thus, the probabil-
ity of a block at one switch becomes the probability of blocks at
N switches. We assume an unpacked Omega (with N processors
attached to 2N input ports), so that the inputs to level one are
all at the A-port of the first layer node. (Figure 1.3). There
are then two possible inputs since it is equally likely that the
address bit will be a one or a zero. These bits determine the
switching operations performed by the node on the address under
the switching rules. It is clear that on the first level of an
unpacked network, there will be no blocks, and that, furthermore,
the second layer is not used. The topology of the network implies
that there are nine possible input combinations to the second
level, each of which has an associated probability. On the second
level, the fact that IA= address, IB = address (where IA and IB
are inputs A and B) is now a possible combination implies that the
second layer is now used, although there are still no blocks on
this layer. There are 49 possible input combinations to the third
level. Blockage is now possible since there can be three inputs
to one two-layer switch pair. On the fourth, and all subsequent
levels, there are 81 input types. This is basically base three in
four places where the three characters 0,i and blank are permuted
over IA, IB, IA and IB. Table 1.4 gives the input, output and
blocking probabilities for these first four levels done in the
hand simulation.
There are two concepts which should be understood concerning the
evaluation of the network through each succeeding stage. They are
a) increasing randomness, and b) decreasing density. While initi-
ally most of the addresses are on the lower layer, conflicts on
the lower layer tend to send more addresses to the upper layer.
In equilibrium, both layers will be equally occupied. Now a nec-
essary condition for a block in the network is that there be two
inputs on one layer, and one on the other. One might think,
therefore, that maximum blocking will occur when the first layer
has twice as many addresses as the second layer, but since
blocking is symmetric between layers, maximum blocking is expected
to occur when the two layers are equally dense, i.e. when the
system is completely randomized. On the other hand, the fact that
blocking implies a decreased density of addresses as the addresses
are blocked means that the number of blocks should decrease as re-
I-5
I ADDRESS] or BLANK
\
I A '
/
IB '
IA IB
/ \
ADDRESS ]j
\
BLANK
or BLANK
Figure 1.3
A node of level 1
I-6
Jquests progress through the network. However, since the actual
number of blocks is small, this effect will be small at first, but
will become more important as the system tends towards equilib-
rium. Thus, we would expect the blocking probability to increase
initially due to the effect of randomization, and then to decrease
due to the effect of decreasing density. However, it is not clear
on which level the turning point will occur.
When it was realized that more blocks would occur on some levels
than on others, a search began for the best way of adding a small
amount of hardware to improve performance. First, it was felt
that one probably not want to take any levels off of the second
layer (with the exception perhaps of the first level), _ince the
loss in efficieny at that level would be greater than any
"marginal gains" at any point in a third layer. Secondly, it is
clear that once a new layer is initiated, it must be continued to
the destination ports if the addresses are not to be inj. cted back
into the lower layers.
This led to the concept of the three-layer network by asking how
such a system might "grow". Indeed, there are some similarities
between the transposition network and the corpus callosum (which
unites the two hemispheres) of the human brain. However i_ seems
somewhat deceptive to think in terms of layers, for each switch
pair may be reduced to a planar circuit with, say, four inputs
being mapped to four outputs. As described in Chapter 5, each of
these input and output sets are composed of twelve or more wires,
at least nine of which control the switch settings for the various
levels of the network; the other three or more wires may play
special parts in the local control of the switch. The frames of
data may follow the 'net-code' through the network to be stored in
buffers at the terminal end.
1.3 SKIP DISTANCE ANALYSIS
When a p-ordered vector _s stored across extended memory, cor-
responding column elements are stored in modules (o+pi)mod521 as
shown in Figure 1.2, where o is the offset and i is the row
number. When each processor gets a succeeding row element, i
becomes the processor number. This is particularly important in
lock-step operation, but is also relevant in the early stages of
any loop.
Results of hand-simulations which were performed for 8 x ii un-
packed Benes and Omega networks are summarized in Tables I.l, 1.2
and 1.3. It becomes clear from these charts that these networks
are symmetric with respect to skip distance, i.e. there is a
correspondence:
skip 1 to skip I0
skip 2 to skip 9
skip 3 to skip 8
skip 4 to skip 7skip 5 to skip 6
:- _ I-7
Level 1 OaOa, ObOb,
** % %
A* % %
*A 0 0
AA 0 0
Blocking
O%
Level 2
** 9/16 9/16
A* 6/16 6/16
*A 0 0
AA 1/16 1/16
0%
Level 3
** 9604/16384 9605/16384
A* 5096/16384 5096/16384
*A 392/16384 392/16384
AA 1292/16384 1292/16384
Level 4
Not completed
1.46%
(7.47 blocks)
1.79%
(8.99 blocks)
Table I.l
Summary of Node-Level
Hand Analysis
iSkip
Table I. 2
Skip Distance Analysis for OEMGA Network
Offset
012345678910 Ave.
14141413223 1 2.6
20 i 0001001 i 0 0.4
30122222222 1 1.8
42131112231 1 1.8
50000000000 0 0
60000000000 0 0
* 71223112131 1 1.8
* 80122222222 1 1.8
* 90100010011 0 0.4
103232141404 0 2.4
*Assured by Symmetry
i
I
J Skip
Table I. 3
Skip Distance Analysis for BENES Network
Offset
012345678910 Ave.
10000000000 0 0
20213220100 1 i.i
30000000000 0 0
41111112121 2 1.3
50000110110 0 0.4
60011000000 2 0.4
71212121111 1 1.3
80000000000 0 0
92022312010 1 1.3
I00000000000 0 0
I-9
I-lO
1.4 TOWARD A GENERAL ANALYSIS OF TRANSPOSITION NETWORKS
In its most abstract formulation, a system such as a transposition
network can be described in terms of its states in a stochastic
process. In an unpacked one layer Omega network, there are n2 n
switches, where n is the number of levels. Each switch can occupy
one of nine possible states, two of which are blocking, and seven
non-blocklng. Much llke a system of N pennies, which can take on
2N different states, an n-leveled transposition network can take
on 9 n2n states of which 7n2n are non-blocklng. These very large
numbers give us every possible combination of switch configur-
ations which the network can occupy.
One problem with such an analysis[2], is that not every state
corresponds to a physically realizable configuration. In
particular, there will be states for which no continuous path can
be drawn.
One might then come to take the path rather than the state of a
set of switches, as our unit of analysis. In an Omega network
there is one and only one path by which any given input can reach
any given output. (In a Benes, there are 2n such paths, one for
each switch from the middle level.) Now assume that there are 2n
inputs-- one for each switch -- and 2n+ r outputs -- where r
includes additional outputs plus one output corresponding to a
null request. Then there are order 2 zn states in the sample
space, each described by 2n input-output pairs.
The problem then becomes one of obtaining the blocking probability
for each of these states. This must involve the structure of the
network itself. One can note, however, that blocking in an Omega
is a function of input pairs, for on any level only two inputs may
share the same switch. A mathematical algorithm, for determining
whether any given input pair results in a block is given in the
following section, Part B. It is noted here that such an algor-
ithm requires, at most, a comparison of each of 2n input-output
pairs for each of n levels. Thus, for order 2 2n states, there are
order n24n or N41og2N comparisons that must be made to completely
determine the blocking probabilities for all possible states.
This number may well be
results, even if it need
purposes. Says Benes: [i]
dishearteningly large
be done only once
for practical
for simulation
In most congestion problems, it is easy enough to construct
(say) a Markov process that is a probabilistic model of the
system of interest. But it is dififcult, because of the
large number of states and complexity of the structure, to
obtain either analytic results or fast reliable procedures.
This circumstance has been a major obstacle to rpgress in the
congestion theory of large systems. One of its consequences
has been that in some cases, models known to be poor rep-
resentations of systems have been used merely because they
were mathematically amenable, and no other tractable models
were available. (pp. 1216-1217)
In another place he talks of possible "equivalence relations"
betweensimple modelsand morecomplexones.
The following is actually a model for determining the probability
that x random assignement from N inputs to N outputs will be
unique. The first input may choose any of the N outputs. The
second input has an (N-I)/N probability of choosing one of the
empty ones. The ith input has an (N-i+I)/N chance of choosing one
of the empty ones. For x random assignements, the probability is
E Fx,,,N) =
i .,,
i !i
that such a mapping will be unique.
The above formula is expected to be related to the probability of
obtaining z successes across N ports in a packed Omega network.
(This suspicion is based on the fact that there is one and only
one path for each input-output pair.) For small x, this function
is presumed to increase linearly, but for larger x and z z seems
to increase more slowly than x. Qualitatively, unpacking the
network corresponds to increasing N, which increases E(z in N).
To find the expected number of successes in an equilibrium condi-
tion, set E(z in N) equal to 1/2 and solve for z. However, for a
more exact and more complicated procedure for obtaining this
result, see Section 1.7.
1.5 PERMUTATION GROUPS AND PARTITION SETS
Bene's proof of the fact that a network of 2Nlog2N switches is
sufficient to ensure the rearrangeability of N inputs to N outputs
was published in 1964 [3]. This article draws heavily on group
theory and the concept of the partition of the set (I, 2, 3, ...,
N). The partition of a set is a finite collection of disjoint
sets whose union is the given set.
A. Consider storing the Benes transposition network of 2n-i
levels as a matrix. (Storing the n-level Omega network is a
special case of this.) On the first row, store the vector (0, i,
2, ..., N-l). On the second row, s ore the vector (0,2,1,3,4,6,
...,N-l), taking the first two even numbers, then the first two
odd numbers, then the next two even numbers, until all N elements
of the vector are stored. On the ith row, for i less than n,
store the first 2 i-I even element, then the first 2 i-I odd ele-
ments and alternate until all the _lements of the set (0, I, ½_ilN-l) are used up. For the nth, and middle row, store the
even elements, then the 2n-I odd elements. (The first half of the
N=I6 Benes network is shown in Table 1.5.) For row i between n
and 2n-l, store just as the 2n-ith row. Now to compute the path
that a given address would follow in the absence of other
addresses, _dopt the following procedure.
I-ii
BENES(FI2)
Skip Offset Successes Options Comments
2 3 251
2 4 246
3 4 512
3 5 512
5 6 413
5 357 382
8 357 139
128 357 230
128 357 266
256 357 233
518 357 512
520 357 512
Skip Offset
OMEGA (FI3)
BR
.... 199
.... 211
1 0 32
1 1 36
2 0 63
3 0 85
4 0 78
13 357 iii
128 357 279
12_ 357 259
210 357 307
260 0 79
Magic
Magic
Magic
Magic
Successes Options Comments
R Seed=O013
BR & 2 Seed=0013
Worst
BR
Note: Standard deviate/on of N Count
is _N
Table 1.4
Simulation of Skip Distances
1-12
ii
0 1 2 3 4 5 6 7 8 9 i0 ii 12 13 14 15
0 2 1 3 4 6 5 7 8 i0 9 ii 12 14 13 15
0 2 4 6 1 3 5 7 8 10 12 14 9 ii 13 15
0 2 4 6 8 i0 12 14 1 3 5 7 9 ii 13 i_
Table 1.5
First Half of Stored Benes
+-
+ I!
I i
i _; 1-13
hExperiments on the CN Simulator confirm this hypothesis. However,
it is not intuitively clear why this is the case, nor is a strict
correspondence between offsets obvious. In part, the answer lies
in the fact that to each skip distance there corresponds a cyclic
ordered set of permutations on the outputs (0, 2, 4, 6, 8, 10, 12,
14, i, 3, 5). This set is itself the correspondence set for skip
i.
For skip 5, the ordered set is (0, i0, 5, 8, 3, 6, i, 4, 14, 2,
12). For skip 6, the ordered set is (0, 12, 2, 14, 4, i, 6, 3, 8,
5, I0). These sets are the same, save that they are oppositely
ordered).
A natural question which arises in this analysis is whether there
are any skip distances which are particularly bad. Of course, the
very worst case will be a skip of 521--which corresponds to a skip
of zero--in which case all the processors will attempt to access
the same memory module. Other than this, and this seems to be a
rather important fact, the greatest number of blocks occurs in an
8 x 11 Omega for skip distances of one, especially those with
small even offsets. Table 1.2 verifies this. Furthermore, for
all the trials which have run on the simulator, skip = i, offset =
0 was the worst, with only 32 successes in 512 trials. The reason
for this is as follows: the second level of an unpacked Omega
will account for blocking of half the inputs if inputs from
adjacent nodes wish to access the same quadrant of the network.
similarly, if adjacent nodes on the next level wish to access the
same octant of the network, half this number will De blocked.
This halving process, as the addresses are "funneled together",
continues until they are half-way through the network, at which
point they are "funneled back out" to their separate outputs. For
odd offsets, the funneling process does not begin until the third
level. For larger offsets, the mod521 configuration of the
unpacked Omega tends to randomize the pattern.
It is not yet clear just what the overall relation between block-
ages and skip distances actually is; largely this problem is
irrelevant. It could be solved empirically by running, say two
hundred simulations picked from the skip distance range (0,260)
for offsets (0,I) and plotting a curve. (Results from a few
selected simulations are offered in Table 1.4.) One would expect
some kind of periodicity. But in fact, every such experiment
which has been run for an Omega network has resulted in a success
rate less than that for random requests (although for Benes net-
works skips 1 and 3 are "magic"), and significantly, it seems that
the bit reversal procedure used for mapping (see Appendix B for
details) is tantamount to a 'pseudo-randomization' of sorts. If
this randomization is hardwired, no skip distance should be parti-
culazly bad.
1-14
ii
!
!
i
i
I
!
i. Initialize the input node, column j.
2. If the control bit for the ith level calls for "go straight",
then the node for level i + 1 will be found on row i+l,
column j.
If the control bit calls for a "go across" and
a. j is even, then the node for level i+l will be found
on row i+l, column j+l, or
b. j is odd, then the node for level i+l will be found
on row i+l, column j-l.
3. Let j=node number and increase i.
4. If i is less than 2n, go to 2.
The value of the node number for each i describes that path taken
by the given address.
B. Suppose one wishes to know whether two given input-ouput pairs
u-w and v-x result in a block for an Omega network. Let U be the
set (0, i, 2, ..., N-l). Let i be the level number for i=0, ...,
n. Partition U sequentially into 2n/2 n-l, 2-by-2 n-I matrices.
Call these_...j-_.2_jz4'. These are the input matrices.. Also parti-
tion U into _n/2n-l,2n-i-by-] matrices Call theseO_ 0_,/2"-_
° _J"" "I _ _ .
These are the output matrices. Then for all (u,. v, w, x), if
there exists a j such that u is an element of __/ and v is an
element of O_ and there exists a k such that w is an element of
0 h and x is an element of 0 4 , then u-w blocks v-x by stage i.
Symbolically, this condition can be written_
• °
The partition sets for n=4 are given in Figure 1.4. For example,
note that 4-8 blocks 7-11 since 4 and 7 are elements of I _ and 8
and ii are elements of O _ . So 4-8 blocks 7-11 by level 2.
Note also that i is not unique but is satisfied for any i greater
than solve minimum i. To make i equal to this minimum i, require
that u and v be from different columns of /J.
This can be proven by considering, for the ith level, the ith most
significant bit. If the ith most significant bit of the two
inputs to any switch are the same, i.e.
XXXX...0i..-__ or XXXX...Ii'--__
XXXX...0i... - XXXX...Ii... -
where X's and 's represent bits that may assume any combination
of l's and 0's, then there will be a block. Bits which are more
significant than i can occur in all possible combinations, but
these bits determine which inputs the addresses could have come
1-15
i=0 I,_=(0) I:ffi(1) I:ffi(2) _=(3) _}=(4)
_:=(5) _,=(6) I;=(7) _=(8) _ffi(9)
_=(I0) I_=(ll) _=(12) _=(13) _ffi(14)
_=(15)
= (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14 15)
i = 1 I;=(0 i) Ii=(2 3) _=(4 5) I_=(6 7)
=(8 9) _=(10 Ii) I_=(12 13) I_=(14 15)
0"= (0,1,2,3,4,5,6,7) 0'.= (8,9,10,11,12,13,14,15)
i21 IiiC i
0_ = (0,1,2,3) 0_ = (4,5,6,77 0_= (8,9,10,11)
0_ = (12,13,14,15)
_J0,_=(0,i) 0_=(2,3) 0_=(4,5) ..9=(6,7)
0_=(8,9) 0_=(i0,ii) _= (12,13) _=(14 15)
Figure 1.4
Partition Sets for n=4
Cr_
!!
from. Thus a pairing between a set of output addresses and a set
of input addresses can be formed. While this is not a formal
proof, this result can be shown combinatorically by enumeration.
C. Suppose one wishes to know the mean probability that two
random addresses will result in a block. This is a function of a
relation between two inputs which is called their 'distance'. Con-
sider input 0. There is a 1/2 probability that it will be blocked
by input 1 since this blocking occurs on the first level. There
is a 1/4 probability that it will be blocked by inputs 2 or 3;
this would occur on the second level. For inputs 4, 5, 6 and 7
the probabilitlty is 1/8 since the level is the third. In
general, the distance for input 0 is the level on which the two
inputs could block, so call it i. Then the probability that the
inputs will block on level i is (i/2) I.
Now assumedly there is a function g(x,y) of any two input numbers
(x,y) such that,Li=g(x!y). Then taking the average value of the
function_C3/_v)_=(_)_C_y) the probability of a block is obtained.
But for this to be a truly random distribution, one must average
over both x and y as shown in Equation 1.3.
Now, for x=0, _ Y
(I.3)
x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
y I 2 3 4 5 6 7 8 9 10 11 12 13 14 15
g(x,y) 1 2 2 3 3 3 3 4 4 4 4 4 4 4 4
For x=0, there will be in general 2_ values of g(x,y)=z.
Now pick some other random value of x, say x=4.
x 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
y 0 1 2 3 5 6 7 8 9 i0 II 12 13 14 15
g(x,y) 3 3 3 3 1 2 2 4 _ 4 4 4 4 4 4
so, again, there are 2 z-I values of g(x,y). We thus have a basis
for a change of variables, _ _- =_r.,_)l_-;4
(I.4)Z(_ 4_
where z=g(x,y) andZIM_=2"';_. Now f_') just equals (i/2)z; and
the limits of the summahion are from 1 to n, where n is the number
of levels.
(1.5)
1-17
then .
which depends only on n. For n=9, (_) = 9/1022.
(I.6)
(I.7)
1.6 TERM ANALYSIS FOR RANDOM BLOCKING
Consider a packed Omega. The blocking probability for two inputs
(i,j) with random addresses is given by f(i,j) = (i/2)g(x,Y;.
Thus a matrix can be made out of the f(i,j)'s. In general, the
f(i,j)'s will either be 1/2, 1/4, 1/8, 1/16, or 0. In particular,
if any input k has no request, then both f(k,j)=0 and f(i,k)=0.
Also, f(i,i)=0.
Here notation will be changed so that it is more in line with
symbolic logic and set theory. Let aiu = f(i,j) and not-aij=l
-f(i,j). Now consider the prospects for adding one more input to
the net. Inputs can be added from left to right to see when it
becomes probable that the new input is blocked. The probability
that input 0 is blocked by 0, a00, is zero, of course. The first
term will be the probability that 1 is blocked by 0, (i.e. al0,
which equals 1/2 for a packed and 0 for an unpacked Omega). The
second term will be the probability that 2 is blocked by 0, but 1
is not, (i.e. p_txml--)_)=_a_-_m ) which is 1/4 x 1/2 for a
packed Omega. The third term will be probability that 1 blocks 2
given not-a20 and not-al0, (i.e. P/_,l_A_=_a_m_2o=_'_
The next term will be
and the term after that is
}gCW31 ] -, o¢.m A "-, _-_.a-', o_ 2%-', (X.,._').
In general, the kth term will be the product of k such atomic
units, of which k-i are negated. In the iterative procedure, one
would have a 'tail' to the end of which the negated form of the
last atomic unit is multiplied before multiplying by the new
atomic unit obtained from the matrix. The kth term is then summed
to the present value of the first k-I terms.
1-18
When in the course of this procedure the current value of this
summationbecomesgreater than 1/2, the procedure may be aban-
doned. The fact that the probability rises above i/2 meansthat
it is expected that this new element will be blocked. A new
procedure is now adopted, in an attempt to find the probability
that any two elements are blocked. The first two terms are zero.
The third group of terms is_,,_-_._,,_¢_-_. The fourth group
of terms will be _o_ -7_t _2o _¢_-_oc_o _ + oL_¢,_ -7_s_ _¢_ •
The fifth group will be_,_ m-7 _-_,_+_¢j,_z_-,_-v_e_-v_'_-_
_a_+_c_-_-,a_. In general, for the kth term group, there
will be k-i subgroups, each composed of k products. The k
products will be given by the permuting of one addtional
affirmative over the smallest k-i matrix element, on one side of
the diagonal. The smallest elements are defined by the fact that
the left subscript must be less than or equal to that of the new)
and the right subscript must be less than it. If the sum of
these terms at any time is greater than 1/2, this procedure, too,
is terminated and a procedure which tests for three blocks is
impl emen ted.
In general, in a procedure !ooking for i blocks, the kth group of
terms will have(_,_.;/_'-_)!_-_)_ subterms, each of which is a per-
mutation of i-i affirmative _ 's over k-I _t 's. When i is large
enough so that the whole network is done while the sum is less
than 1/2, then this is just the expected blocking rate. Since
there are on the order of .5n 2 groups of te_ms (for half the
atomic coefficients in the squa_'e array, with{_-_)_z/_z'-_)/{/_-4")/.
subterms in each group, then there are at least_N'C_'-_).;/Cd"-,)/_-_)/(_F-_).;
operations in this procedure. For large problems there are many
blocks, and i may be on the order of 100, making the computation
even more prohibitive than that suggested in Section 1.4.
However, there may be a "coarser" way to estimate the network
blocking We have noted that there are k atomic elements in each
of the/2-_)//_L-_._(/_-_/ subgroups. The minimum number of elements
in any subgroup is i, for i blockages. For the kth group, each of
these subgroups will be composed of i affirmative _ 's and k-i
negative 's. Now the average value of one of these _ 's i8 as
shown in Section 1.5, n/2(2n-l) or log2N/2(N-l). Similarly, one
could show that the average value of the function l-(I/2)g_x,Y) is
l-(log2N/2(N-l)). (Assume that this average value of each of the
terms found in this way will be good estimators for the product.
Basically this average says that the typical block will occur at
the log2(2(N-l)/n)th level. For n=9, this is about 6. In this
way, the computation can be drastically reduced.) Each of these
groups can be written as a product of one of these estimator terms
times the number of such terms. And since there are such groups
for all k from i to N, we are left with the sum
- -
1-19
which depends only on i and N. _? largest such term occurs fork=i, and is just (l-(log2N/2(N-I) (Since E(i in N) is greater
than its first term, a good way to _ake a lower limit
approximation for i i_ to use the least i such that this term is
less than 1/2. For N=512, this is just (1013/1022)i.) The last
and smallest term in the series for k=N, which we call Emin(i in
N) may be written
Note the resemblance between the part in brackets and the form-
ula in Section 1.4, with N-i corresponding to x. One major dif-
ference between the two is that the 'i' in the former is the
number of blocks, while the 'x' in the latter is the number of
successes.
1.7 STATE OF THE CONNECTION NETWORK
One of the networks proposed is the two-layer unpacked Omega with
bit reversal and alternating priorities between layers and cycles.
In fact, it is suspected that a hardwired processor-to-input
randomization would work as well as a bit reversal. Any priority
rule that favors the left port will favor addresses going to the
left side of the network, and vice versa. However, a random
priority rule, where the priority is determined by a random number
from 1 to 4 (favoring left, right, straight-through, and crossed)
would probably be optimal. One way to improve the priority rules
is to add a bit to the address which says: "I am a success so
far." Then if there is a conflict, and if one or the other of the
addresses has say a 1 in this place, then the switch will give
that addresss the priority.
The Benes network now appears suboptimal. In the absence of
overall control, an algorithm must be developed which produces an
address from the first half of the network. The algorithm studied
obtained the address through an "exclusive or" on the processor
and memory module numbers. This algorithm has a serious flaw in
it for any unpacked Benes. As long as the ports are unpacked,
both processor and EM numbers are even--except for the nine
odd-valued memory module numbers. The fact that the least
significant bit is zero implies that the addresses go straight
through on the first level; this in turn implies that at the
middle level of the network all the addresses are in the left half
of the switches. Thus, at the middle layer, the addresses are
're-packed'. Also, at every level prior to the middle one, only
half the switches are used. Needless to say, this seriously
degrades the simulation of any unpacked Benes. However, this
problem should be rectifiable with a bit reversal.
1-20
Part of this analysis considered various alternative network
organizations in case additional throughput is required. Oneway
to improve performance is to 'double-unpack' the inputs, so that
there is only one address for every four ports. This implies, as
well, adding another level to the network. To comparethe effect
of doubling the numberof ports with that of adding another layer,
a packed two-layer Omegaand a packed one-layer Omegawith 256
requests were simulated. If the simulation is to scale properly,
one would expect to find half as manysuccesseswith 256 inputs to
512 ports as with 512 inputs to 1024 ports. In simulation, the
packed two-layered Omegawith 256 inputs produced 70 successes.
Thus in performance a double-unpacking appears equivalent to
doubling the numberof layers. This further suggests that for the
first cycle the success rate dependsonly on the number of input
ports for an Omega network.
Still probably the best way to reduce blockages experienced in the
networks studied is to add more layers. As was previously noted,
it is deceptive to think in terms of "Layers". Any n-layered
network of 2 x 2 switches can actually be represented as one layer
of 2n x 2n switches. When Benes proved his theorem, he did so for
any Benes network of square switches, i.e., n x n. Now for an
Omega Network, the address generates the path at the switch by a
simple procedure of left-right responses. And while it may be
difficult to construct a local control procedure f_r binary
addresses or odd-valued n x n switches, equivalences can be set up
between n inputs and n outputs for a 2n x 2n switch. The result-
ing logic at each switch would be more complicated in this case.
1-21
0,,n
0
0
.,4
,'-4
I
4-J
Cn
0
1-22
If
References
(i) Lawrie, Duncan Hamish, "Memory-Processor Connection Networks",
February 1973.
(2) Benes, V. E., "Heuristic Remarks and Mathematical Problems
Regarding the Theory of Connecting Systems", Bell System
Technical Journal, 41, (1962), pp. 1202-1247.
(3) Benes, V. E., "Permutation Groups, Complexes, and Rearrange-
able Connecting Network", B.S.T.J., 43, (July 1964), pp.
1619-1640.
(4) Benes, V. E., "A 'Thermodynamic' Theory of Traffic in Connect-
ing Networks", B.S.T.J., 42 (1963), pp. 567-607.
1-23
APPENDIXJ
DESIGNGUIDELINESFORNASFPROCESSINGSYSTEM
J.] SCOPE
This document delineates general guidelines that may be used in
the design, fabrication and as%emb]yof the hardware required for
the Numerical AerodynamicSimulation Facility (NASF) Processing
System.
J.2 DESIGN CONTRAINTS
J.2.1 Environmental
The environmental limits specified represent the conditions nor-
mally found in most laboratory or office buildings deemed suitable
for professional employees and assumes that air conditioning and
other controls have been provided to attain these levels. It is
incumbant on the design of the FMP hardware not to adversely
affect this environment.
J.2.1.1 Atmospheric Conditions
Table J.] defines the limits of temperature, humidity, and alti-
tude for operatzng, non-operating, storage and shipping condi-
tions.
Dust levels may exist to the extent resulting from a filtered air
conditioning system meeting NBS blackness test with a minimum
rating of 50% efficiency using atmospheric dust.
J.2.1.2 Mechanical Stress
Table J.2 delineates the mechanical stress levels for the equip-
ment installed (operating and non-operating) and in shipping con-
tainers.
Shock is defined as a non-periodic mechar, ica] pulse of large ampli-
tude about a fixed point.
Vibration is a steady state periodic or random oscillation which
may have a sinusoida] or a complex waveform and may have a single
frequency or broad spectrum.
J.2.].3 Acousfic Noise
The equipment should not be affected by exposure to sound pres-
sures of 130 dB* (c) for a period of 30 minutes.
* Ref. 2 x 104 dynes/cm 2
J-i
TABLE J.l ATMOSPHERIC CONDITIONS
OPERATING NON-OPERATING
SHIPPING AND
STORAGE
(Installed) (Installed)
DRY BULB TEMP. 18oC to 30°C -40°c to 50Oc -40°c to 70Oc
WET BULB TEMP. NA 30Oc Max 40Oc Max.
RELATIVE HUMIDITY 40% to 60% 90% Max. 95% Max.
ALTITUDE 0-3 km 0-3 km 0-15 km
TABLE J.2 MECHANICAL STRESS
SHOCK
Peak Acceleration
Duration
Waveshape
Force application
VIBRATION
Frequency Range
Peak Acceleration
Force Application
Operat ing and
Non-Oper at ing
(Installed)
.5g
.I to 1 see.
½ sine
Hot izont al
5 %o 500 Hz
.Ig
30rthogona3 axes
Shipping and
St ol age
(In Shipping container)
5g
5 to 50 millisec.
½ sine
30rthogonal axes
5 to 500 Hz
1.5 g
30rthogonal axes
J-2
iJ.2.].4 Radiation
The equipment should not be. affected by radiation of the following
intensities
(l) Stray magnetic fields
(2) ExternaI RFI
.0005 tes]a
3.0 Vo]t/Meter 500 kHz to
10GHz
J.2.i. 5 Static Electricity
Externally exposed hardware should be immune to static electric
discharges of up to i0 kilovolts from 500 pF through 50 ohms.
J.2.1.6 Fungus
Fungus inert parts and materials should be used to the greatest
extent possible. Parts or materials not in,,rt to fungus growth
should be treated with fungicidal material. No damage to parts or
material should result from treatment with fungicidal material or
fungicidal coating.
J.2.2 Electromagnetic Interference Control
The NASF equipment should be compatible with: a) other electronic
devices operating in the immediate area, and b) communications
services. Control of the electromagnetic emanations from the NASF
equipment must b,, an integral part of overall system design.
Based on the nature of the NASF design and mission, conducted and
radiated emanations shall comply with the limits illustrated on
Figures J.l and J.2 respectively.
J.2.3 Acoustic Noise Control
Personnel should be provided an acoustical environment which will
not interfere with, or in any way degrade overall NASF effective-
ness. To ensure compliance with this requirement, acoustic noise
levels of the NASF Processing System should not exceed the fol-
lowing criteriaz
EQUIPMENT AREAS 75 dB (A) *
68 dB SIL **
OPERATOR AREAS 65 dB (A)
58 dB SIL
* dB(A):
Meter.
** dBSIL:
I/O AREAS (ELECTRO
MECHANICAL DEVICES)
80 dB (A)
71 dB SIL
Measurement using (A) weighting network on Sound Level
Speech Interference Level,- The arithmetic average of
the sound-pressure levels in the octaw bands centered
on 500, i000, and 2000 Hz.
J-3
J
L
80.
> 7O.
50
,45
_ 7O
1.6
Frequency MHz
3O
a) Narrowband Limit - Average Detector
>
i00- I
90" I
80.
70
91 91
.45 1.6 30
Frequency MHz
b) Broadband I.imlt - OuasiDeak Detector
* AVERAGE DETECTOR: A detector, the output voltage of which
approximates the time average value of the envelope of an applied
signal. Refer to ANSI C63.2-197X
** QUASIPEAK DETECTOR: A detector having specified electrical
time constants, which, when regularly repeated pulses of constant
amplitude are appl_ed to it, delivers an output voltage which is a
fraction of the peak value of the pulses, the fraction increasing
towards unity as the pulse repetition rate is increased. Refer to
CISPR Publications, i, 2, and 4.
Figure J.l Conducted Limits
J-4
0!
i
4
>
@
>
0
5O
4O
30
20
i0
3_
i I
30 88 216 i000
Frequency - MHz
Broadband Limit - Quasipeak Detecto[
Narrowband Limit - Quasipeak Detector
I Figure J.2 Radiated Limits (30 Meters)
J-5
J.2.4
Table J.3 and J.4 delineate the minimum quality ]eve] of the power
tha_ should be available for the NASF Processing System. The
Processing System should have its own power control and distri-
bution subsystem that will operate from this input power and
supply the appropriate power to the various hardware elements of
the processing system.
J.2.5 Design and Construction
Unless otherwise specified, the NASF Hardware should be designed
in accordance with good commercial practices.
J.2.5.] Physical Characteristics
J.2.5.].] Cabinets - Removable panels and doors should be
utilized to enclose the structure.
J.2.5.]_2 Size and _ - No single unit, cabinet or component
should excee-_ 3,600 pounds or exceed the following dimensions:
Height 72"
Width 84"
Depth 35"
The floor loading should be no more than 250 lbs/ft 2 for fully
operable equipment.
J.2.5.l .3 Marking
J.2.5.].3.] Marking of Equipment - Each major assembly shou]d be
permanently and legibly marked with the manufacturer's identifica-
tion (name, initials, trademark, code number, or symbol) serial
number, and mode] number. Permission shall be granted te the
manufacturer to place its name/symbol on the front of the equip-
ment.
J.2.5.].3.2 Marking of Controls - Controls related to the oper-
ation or conditioning of the equipment, either remotely or
locally, should be clearly identified.
J.2.5.].3.3 Marking of Subassemblies - All removable and repair-
able plug-in subassemblies should be identified and marked with a
serial number. Labels should be positioned so they can be readily
seen.
J-6
ii
i
c:
Table J.3
Power Source Description
Total Power: 750 KVA
Frequency: 60 Hz
Tolerance: + 3%
Rate of Change: 1.5 Hz/Sec Max
Voltage: 480 3 Phase, 3 Wire plus ground
Range of slow-averaged + 10%, -15%
rms voltage (including
brown outs)
Imbulance 5% Max
Modulation 1% Max
Harmonics (total) 20% Max
Max Any Harmonic 10% Max
Deviation Factor 25% Max
D. C. Component 1% Max
J-7
%
Table J.4
PowerSource Transients, Recovery and Capability
*Power Sourer) Transients
(on a._b-r afY pha-s6s)
1/2 Cye]e or Longer
Ma_fi6m Tran_,i61{t Surge
:4ax[mum T_'ansient Sag
Less Thon 1/2 C_,ele But
::fore-tl_-al_-100 _nlCrOSedbnds Maximum
Maximum Ovorvol tago
Transient volt-seconds
Maximum Transient sag peak
Maximum Undervoltagc Transient
volt-seconds
Maximum Voltage Deviation
of RFI. 10kHz or greater
Event Rate
P qwe[ Squrc9 Capahil{t_
Peak Inrush L*mit
Load Imbalance
Source rnipedanco
Cround Return Impedance
i
LEVICL _ RI'_MAR K S
To 130'_' of liominal rms I *Vol tag_, dovJatJol)s sb,_] ]
voltage reeovcrinq to }20', i b(" wlthJn th,_, limits shown
i]1 50ms or less, then wit|lin I whe, n L]R _ LItilizati.,n voltage
]I0',, in 3 see. or less i:_ within it-s tnlcrdnc(, limits
(410':, -15;' )
To 50'. of nominal rms
voltage, recovering to 70_
in 100ms or less, then to)
85? or more in 0.5 sec. or
less
150t of nominal peak voltage
(212% of menial rms vo]taqe)
provided than volt-second
limit i:, _ot exceeded
150_ of nominal volt-seconds
provided serve voltage limit
above is not exceeded
To zero volts
To zero for 1/2 cycle
400: of nomina] peak voltage I
(566[ of nominal rms voltage)
Maximum of I0 in I0 minutes
and _t ]east 6 seconds be-
twee_ maximum l_mi t events
and full recovery to speci-
fied ranq_ of rms voltage
between c,ve_t .
i
4kVA or I tn 8 x rated kVA
of load
!2S?, max or 10kVA whichever
is qr_'al_ r
Id.5_ t(_ 5% o|- Rai,,d "l_as_."
o]]_Is at the power fr_,qu(,ncy
!I.i[)_}dan('o low enou(l]l t(_ crf, dL_,
Igr<lun(l faul[ c%_rront ¢_f lO :¢
!brcak_,r tl'J[) rntinq fat qroun,|
) fault
Surge component ,_nl_'_ would
be 250 total _f oceurJnq at
wave peak
Composite wave form
Tm_)u]se component onl_,, w(.u],_
be 500_. if oceur_nq at wav_:
: peak
_Starting c_ndition fo_" nl |
ev,.nts shall b(, within sl:c.ei-
fJO(] range cf z'ms voltag_._ ,|n,]
may be at worst CdS(* 1() mdxi-
mi:,¢) the event.
J-8
i
i
i
J
!
!
a
J.2.5.1.4 Accesslbl]ity - All adjustments and any other work
required after the system is assembled should be readily access-
ib]e for servicing. Special tools or other mechanical devices
required to readily and accurately adjust the equipment should be
supplied as part of the unit. The equipment should be designed to
protect operating and maintenance personnel from contact with
hazardous devices. The equipment may, however, be designed to
operate with the frame covers removed (but not necessarily meet
acoustic and EMI requirements under these conditions).
J.2.5.1.5 Grounding - Circuit grounds should be is]oated from
chassis grounds but may be connected to the chassis if required.
Ground connections to chassis should be mechanically secured by
soldered terminals and locked by means of a ]ockwasher and nut. A
chassis ground tiepoint should be provided at the power interface
of individual cabinets and major elements of the system.
J.2.5.1.6 Mechanical Operation - All controls should operate
freely and Sm0oth]y without binding, sc_*aping, or cutting. Play
and backlash should be minimized and should not cause poor contact
or inaccurate setting.
J.2.5.1.7 Transportability - The elements of the NASF Computing
System should be transportable by qualified domestic common car-
rier without damage or deterioration when packaged, preserved, and
prepared for shipment.
J.2.5.2 Materials, Processes, and Parts
J.2.5.2.1 Parts Selection - The following principles should be
utilized in parts se]ectlon:
(a) The variety and types of parts required should be kept
to a minimum.
(b) Common and regularly stocked parts should be used
whenever feasible to simplify maintenance, storage and
supply.
(c) The use of proprietary components should be avoided where
practicable.
J.2.5.3 Workmanshi_ - The equipment should be processed in such a
manner as to be uniform in quality and free from defects that will
adversely affect life, serviceability, and appearance. All metal
surfaces should be clean and free from burrs, roughness, oxide,
scales, and sharp edges. Printed circuit boards should be free
from cold soldering, corrosion, salts, smut, grease, finger
prints, flux residue, and foreign materials.
J-9
J.2.6 Pzoduct Safety
The NASF equipment should be designed and constructed so that in
normal operation and maintenance the equipment wi]] function
re]iab]y without causing injuries to persons or damage to prop-
erty, considering possible care]ess use that may occur in normal
service. Specifically, the equipment should be designed to comply
with the requirements of Underwriters Laboratories (UL) Standard
for Safety, Data Processing Units and Systems, UL 478.
J.2.7 Service Life
The intended ]ire of the system as a system is ten years. Service
life is the anticipated life of the system, as a system, without
reference to the anticipated useful life of the parts of the
system and, therefore, assumes that necessary maintenance and
repairs wi]] be performed as required.
J-10
