Digital system for structural dynamics simulation by Glor, C. et al.
N A s A - C r< - / t:, '1, D / 7 
NASA-CR-168019 
19830005868 
NASA CR-168019 
MTI83TR12 
U,i'GLEl flt:SEf,RCH CE:.iel: 
LIGf\fRY Nr'\SA, HA:,jFrotl, VA. 
DIGITAL SYSTEM FOR 
STRUCTURAL DYNAMICS SIMULATION 
by A.1. Krauter, L.J. Lagace, M.K. WOjnar and C. Glor 
SHAKER RESEARCH CORPORATION 
prepared for 
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION 
NASA Lewis Research Center 
Contract NAS 3-22546 
111111111111111111111111111111111111111111111 
NF01875 
https://ntrs.nasa.gov/search.jsp?R=19830005868 2020-03-21T05:16:33+00:00Z
1 Report No I 2 Government AccesSIon No 3 RecIPient's Cltalog No 
NASA CR 168019 
4 TItle and SubtItle 5 Report Date 
DIgItal System for Structural Dynam1cs S1mulation November, 1982 
6 PerformIng OrganIzatIon Code 
7 Author(s) 8 PerformIng OrganlZltlon Report No 
A. 1. Krauter, L. J. Lagace, M. K. WOJnar, C. Glor 83TR12 
10 Work Unit No 
9 PerformIng OrganizatIon Name Ind Address 
Shaker Research Corp. 
968 Albany-Shaker Road 11 Contrlct or Grlnt No 
Latham, New York 12110 NAS3-22546 
13 Type of Rlport and PerIod Covlred 
12 Sponsortng Agency Name and Address Contractor Report NASA 
Lew1s Research Center 14 Sponsortng Agency Code 
21000 Brookpark Rd 
Cl pvp 1;, nrl OH 44 I 1t; 
15 Supplementary Notes 
FInal Report Project Manager: L. J. KIraly 
MS 23-2 
NASA Lew1s Research Center 
16 Abstract 
The obJect1ve of thIS program, was to develop and document the des1gn of a digital device for 
structural dynamICs simulat10n. The simulator design has incorporated state-of-the-art digital 
hardware and software for the simulation of complex structural dynam1c interactions, such as 
those wh1ch occur 1n rotat1ng structures. The targeted uses of this system include simulations 
and parametr1c des1gn studies to identify 1mproved design crlter1a and methodology, to identify 
structural dynamICs 1nstabllit1es and to evaluate the effects of local non-linearities, transIent 
loadIngs, and eng1ne control 1nstabilities. 
The system has been des1gned to use an array of processors where1n the computat10n for each 
phYSIcal subelement or functional subsystem would be ass1gned to a single specific processor 1n 
the s1mulator. These node processors are custom deSIgned mIcroprogrammed bit-slice microcomputers 
which function autonomously and can communicate with each other and a central control minicomputer 
over parallel dIgital lines. Inter-processor nearest neighbor communications busses pass the 
constants which represent phys1cal constraints and boundary conditions. Each node processor has 
its own program and data memory. Each node processor calculates its results independently and 
simultaneously WIth the other node processors. The node processors are connected to the six 
nearest neIghbor node processors to simulate the actual physical 1nterface of real substructures. 
Computer generated f1n1te element mesh and force models can be developed with the aid of the 
central control min1computer. The program so developed is converted to the proper format, 
segmented and loaded into the 1ndividual processors wh1ch make up the simulator. The control 
computer also oversees the an1mat10n of a graphics display system, disk-based mass storage along 
w1th the 1ndiv1dual processIng elements. 
The mathematical approach to Simulating the dynamic behavior of an engine system is based upon the 
exp11cit t1me 1ntegrat10n of the state vectors assigned to the individual processors. An 
lnteoratl0n techn10ue such as fourth order Runoe-Kutta 1S applicable to thIS analvs1s. 
17 Key Words (Suggested by Author(s)) 18 D,strtbut,on Statement 
Slmulat10n, H1gh Speed Rotat1ng System Unclass1f1ed, Unlimited 
S1mulat10n, Rotat1ng Structures S1mulation 
Parallel. ProcessIng Methods. Dig1tal Processor 
Arrays 
19 SecUrtty ClISSlf (of thIS report) 120 Securtty Classlf (of th,s page) 21 No of Pages 122 Prtce' 
UnClaSSlf1ed Unclassif1ed 125 
• For sale by the NatIOnal Technical InformatIOn SerVice, Springfield, Virginia 22161 
NASA·C·168 (Rev 10·75) 
TABLE OF CONTENTS 
Page 
SUMMARY. • • • • • • • • • • • . • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1 
INTRODUCTION. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • . • • • • • • • • • • • • • • • • • • • • 3 
ANALYSIS - REQUIREMENTS FOR SIMULATION. •••• ••••••••• ••••••••••• ••••••• 5 
SAMPLE PROBLEM 1 - TOMKO PROBLEM................................... 5 
SAMPLE PROBLEM 2 - DETERMINATION OF PROCESSOR 
COMPUTATIONAL REQUIREMENTS. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• 14 
SOFTWARE REqUIREMENTS.............................................. 22 
HARDWARE REqUIREMENTS.............................................. 22 
OVERVIEW OF SYSTEM ARCHITECTURE-SOFTWARE •• ••.••••••.••.•••••••••.••••• 23 
CONTROLLER OPERATING SYSTEM SOFTWARE............................... 23 
OFFLINE MODEL DEVELOPMENT SOFTWARE................................. 28 
REALTIME MODEL EXECUTION SOFTWARE..................... •• • • • • • • • • • •• 31 
NODE PROCESSOR OPERATIONAL PROGRAM................................. 34 
NODE PROCESSOR MICROCODE.. . • . • • • • • • • • . . • • • . • • • • • • . • • • . • • • • • • . • • • . •• 34 
DIAGNOSTIC SOFTWARE................................................ 35 
NODE PROCESSOR ARCHITECTURE.. . • . • • • • • • • • • • . • • • • • . • • • • • • . • . • • • . • • • • • • •• 35 
MEMORY.. ••• .• .•••• ••••••••• ••••••••••• ••••••• •• .• •• ••• •• ••••••••••• 36 
DATA TyPES .••••.••••••••••••••••••••••••••••••••••••••••••••••••••• 37 
NODE PROCESSOR REGISTERS........................................... 38 
PROCESSOR STATUS WORD.............................................. 38 
INPUT /OUTPUT . • . • • • • • • • . . • . • • . • • • • • • . • • • • • • . • . • • • • • • • • . • • • • • • • • • • • •• 40 
PROCES SOR TRAPS... • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • . • • • • • • • •• 40 
INSTRUCTION FORMATS................................................ 41 
NODE PROCESSOR INSTRUCTIONS........................................ 43 
SOFTWARE SPECIFICATION SUMMARy........................................ 51 
SOFTWARE ASSESSMENT................................................... 54 
NODE PROCESSOR HARDWARE............................................... 56 
MICROPROGRAM CONTROLLER........ •• • • • • • • • • • • • • • • • • . • • • • • • . • • • • • • • • •• 59 
REGISTERED ARITHMETIC LOGIC UNITS ••••••••••••••••.••••••••••••••••• 64 
DYNAMIC MEMORY ••••••••••••••••••••••••••••••••••••••••.•••••••.•••• 68 
NODE PROCESSOR COMMUNICATIONS...................................... 73 
83 FLOATING POINT BUS INTERFACE AND SCRATCH PAD ....•........•......•.. 
FLOATING POINT MULTIPLIER ......................................... . 91 
/ / 100 FLOATING POINT ADDER SUBTRACTOR DIVIDER ........................... . 
HARDWARE ASSESSMENT .................•.............................. 110 
DISCUSSION OF RESULTS ..... " ..........••.....•....•.....•.......•..... 111 
112 SUMMARY OF RESULTS AND RECOMMENDATIONS ....•....•.•..•................. 
APPENDIX I - SUMMARY OF CURRENT SIMILAR SIMULATION PROGRAMS .......... . 
APPENDIX II - SUMMARY OF RELEVANT PAPERS DESCRIBING PARALLEL 
PROCESSING SIMULATION PROCEDURES ....................... . 
LIST OF ILLUSTRATIONS 
1. Sample Problem 1 Physical Model............................ 6 
2. 5 x 5 x 5 Node Processor Array ....•.....•..•••••......•.••. 24 
3. To/From Six Nearest Neighbor Block Diagram •.•.•.•.......•.. 25 
4. System Block Diagram.. .. . . . . . . . ... . . . . . .• •. .• . . . .• .• .. • . • . . 26 
5. Global Bus Interface Block Diagram.... . • • • . . • .• . • . • • . . . • • • • 27 
6. Node Processor Block Diagram ••.•.........•..•...••.••••.•.. 57 
7. Microprogram Controller Block Diagram... . .. .. . . • .• .• .• . . . .• 60 
8. RALU Block Diagram... • .• .• .• • .. .. •. . .• .. . . . .• •. • . • •• .• •• . . • 65 
9. Dynam~c Memory Block Diagram.. .• •• • . . .. .. . .. .. •• • . . .. .• • .. . 69 
10. FIFO Buffered Communications Interface Block Diagram •...•.. 75 
11. Six Way Communicat~ons Interface........................... 77 
12. Six Way Communications Controller, Output Loop .•.•.....••.. 78 
13. Output Operat~on of the Six Way Communicat~ons Controller •• 79 
14. Six Way Communications Controller Flowchart, Input Loop ...• 8Q 
15. Input Operation of the Six Way Communications Controller •.. 81 
16. Floating Point Bus Interface Block Diagram •.•............•• 87 
17. Floating Point Multiplier Flowchart........................ 94 
18. Floating Point Multiplier Block D~agram............. ..•.... 97 
19. Floating Point Adder/Subtractor/Divider Block Diagram ••••.• 102 
20. Floating Poin t Add~ t~on Flowchart.......................... 105 
21. Floating Point Subtraction Flowchart. .••••.••..•...•.••..•. 107 
22. Floa tmg Point Division Flowchart. •••..•.....••............ 109 
SUMMARY 
The objective of this program, conducted by Shaker Research Corporation, 
was to develop and document the design of a digital device for structural 
dynamics simulation. The simulator design has incorporated state-of-the-
art digital hardware and software for the simulation of complex structural 
dynamic interactions, such as those which occur in rotating structures. 
The targeted uses of this system include simulations and parametrlc design 
studies to identify improved design criteria and methodology, to identify 
structural dynamics instabilities and to evaluate the effects of local 
non-linearitles, transient loadings, and engine control instabilities. 
The system has been designed to use an array of processors wherein the 
computation for each physical subelement or functional subsystem would be 
assigned to a slngle specific processor in the simulator. These node 
processors are custom deslgned microprogrammed bit-slice microcomputers 
which function autonomously and can communicate with each other and a 
central control minicomputer over parallel digit,ll lines. Inter-processor 
nearest neighbor communications busses pass the constants which represent 
physlcal constraints and boundary conditions. E.lch node processor has its 
own program and data memory. Each node processor calculates its results 
independently and simultaneously with the other node processors. The node 
processors are connected to the six nearest neighbor node processors to 
simulate the actual physical interface of real substructures. 
Computer generated finite element mesh and force models can be developed 
with the aid of the central control minicomputer. The program so developed 
lS converted to the proper format, segmented and loaded lnto the individual 
processors which make up the simulator. The control computer also oversees 
the animatlon of a graphics display system, dlsk-based mass storage along 
with the individual processing elements. 
, -1-
The mathematical approach to simulating the dynamic behavlor of an englne 
system is based upon the expllcit time integration of the state vectors 
assigned to the individual processors. An integration technique such as 
fourth order Runge-Kutta is appllcable to this analysls. 
The hardware was deslgned as an array of 125 processors ln a cubic structure. 
Each node processor was designed for very high speed multlplicatlon and 
addition which are fundamental requlrements for the tlme-step integratlon 
algorithms. This lmplementation of an array of processors operating ln 
parallel has the capabillty of solving simulation problems an order of 
magnltude faster than a conventional serial computer. 
-2-
INTRODUCTION 
This program was initiated to investigate and document the design of a 
Digital System for Structural Dynamics Simulatlon. The intended use for 
the system is a design and analysis aid for the productJon of gas turbine 
engines. This simulator would realize both a savings of money and of 
time by reducing the need for extensive prototypJng during the develop-
ment of gas turbine engine hardware. While simu1ation methods exist for 
use on main frame computers, this system uses the architecture of an array 
of processors for greater throughput. 
Current and on-going work in the field of parallel proc£>ssing and simulation 
was reviewed and is summarized 1n Appendices I and II. 
Special acknowledgement is given to Mr. L. James Kiraly, Project Manager 
of the Digital System for Structural Dynamics Simulation at NASA-Lewis 
Research Center. Mr. Kiraly's contributions were pertinent to the success 
of this effort. His ideas are reflected throughout this final report. 
The scope of this work 1ncluded the study of simulation technlques useful 
w1th parallel processors, the design of the system archltecture, the hard-
ware design of the individual processors in the array (node processors), 
and the design and flowcharting of the processor instructions. Special 
emphasis was placed upon the design of the system architecture and the 
node processor hardware. Particular attention was paid to the high risk 
areas of the node processor design. For instance, floating point hardware 
for multiplication, addit10n, subtraction and dhision was designed through 
detailed schematic diagrams. The CPU and communJcations controller, being 
more conventional, were detailed only to the block diagram stage. The 
node processor memory was designed to the schematic level. 
Software design centered on the node processor. A very large and powerful 
custom instruction set tied lntimately to simulation technlques was devel-
oped. The concepts necessary for on line and off line ~oftware were 
started, but no attempt was made to write this software. 
-3-
The concepts of simulation with regard to an array of processor solution 
were developed in broad terms. Segmentation of a problem, problem Slze 
and computational requirements were carried to the level sufficlent for 
node processor deflnition. 
For a successful implementatlon of this system, follow-on is needed ln 
all three of the above areas. A sample problem must be developed and 
programmed on a main frame computer. Further hardware detalling is neces-
sary to complete the node processor design. A breadboard verSlon of a 
node processor should be constructed, microcoded, and programmed with 
the sample problem. Software development is needed to reach this proto-
type stage. Additional software development is then required to lmple-
ment the entire array of processors. 
The following section describes the slmulation analysis leading to the 
system architecture and node processor archltecture. 
-4-
ANALYSIS - REQUIREMENTS FOR SIMULATION 
In this section the groundwork for the Digital System for Structural Dynam1cs 
S1mulation is developed. Two sample problems are discussed. The f1rst 
problem describes a simplified nonlinear s1mulation problem and its solution 
on an array of processors. The second problem is linear and is used to set 
the bounds on the hardware by limiting the overall size of the physical model. 
Some additional detail on problem sub structuring is brought out. A time 
step integration method appropriate for the solution 1S developed. Finally, 
the hardware and software requirements are outlined prior to their detailed 
discussion in the following two sections. 
SAMPLE PROBLEM 1 
Th1s sample problem, based on work by J. J. Tomko, is used to illustrate 
techniques that can be used to solve a simulation problem on an array of 
processors with time step integration methods. Other models and techniques 
other than the ones presented for this model can also be employed. The 
model is shown in Figure 1. The model contains m disks including shaft seg-
ments which have mass. Each disk may contain n blades and there are b 
bearings. 
Disk Representation 
Each disk is represented by five coord1nates. These coordinates are two 
translations and three rotat10ns. Each disk is taken as rigid. The disk 
equation 1S of the form: 
[Mdi ] {qdi} f(t,{qdi}, {qdi} , {qd 1-l}, {qd . I}, 1- {qd 1+l}, {qd i+1} 
{qbiJ }, {qbij }, {qci }, {qci} ) (1) 
where {qdi} is the f1ve element state vector for disk 1 
{qd i-I} 1S the f1ve element state vector for disk 1-1 
{qd i+l} is the five element state vector for disk Hl 
{qbij} is the state vector for blade J on dt&k i. This vector 
can contain 4 elements. 
-5-
SAMPLE PROBLEM 
PHYSICAL MODEL 
Figure 1 Model Containing m Disks (Including Shaft Segments 
Having Mass). Each Disk Can Conta1n n Blades. 
There are b Bearings. Model Based on That of 
Tomko, J.J. 
-6-
1S the state vector for the casing at disk i. This 
vector can contain 4 elements at a bearing or 80 elements 
at a bladed disk. 
can be non-diagonal and time variable. The non-diagonal 
character arises from the use of a point other than the 
mass center for the definition of the translational 
degrees of freedom. The time variation arises from this 
and from the use of fixed Cartesian coordinates. 
Blade Representation 
Each blade is represented by two lumped masses helving two degrees of freedom 
each. These degrees of freedom are axial and circumferential translation. 
The equation for each blade is of the form: 
where 
is the 4 element blade state vector, and where 
is the state vector for the casing at disk i. This vector 
can contain, say, 40 radial and 40 aXla1 coordinates for 
a casing cross-section at a blad(!d disk. 
The matrix [~ij] is assumed to be diagonal and ('ohstant - the equations 
for the blades must be written using inertial coordinates. 
Casing Representation 
Assume the caslng has been modeled by a large scale finite element program 
Wh1Ch 1nc1udes, say, 40 nodes at each disk. Each of these nodes can have 
radial and axial displacement. At bearing stations, the flnlte element 
program provides at least two displacements and two rotations. 
Assume also, that the finite element program produces up to 10 modes. For 
-7-
each mode, the mode vector is {u }, where this vector contains all of the 
r 
points used in the finite element analysis. The vector{u .} is that part 
r1 
of the modal motion occurr1ng at cross-section 1. 
At disk i (i.e., cross-sect10n 1), the casing coordinates are glven by 
10 
L 
r=l 
{uri} cr(t) where cr(t) is the modal mot10n of mode r. 
The equation of mot10n for mode r is 
2 t m {u .} {fed } 
+ 2nw C + L r1 1 (3) C W c r r r r r i=l {u }t [M ] {u } 
r c r 
where n 1S a user-supplied modal damping factor, w. is the modal frequency, 
1 
{f(t)} is the force of the blades or of the bearing on the cas1ng at 
station i, and [M ] is the mass matrix for the finite element model. Note 
T c 
that {u } ·[M ]·{u } 1S a slng1e scalar number for each mode and need be 
r c r 
computed only once. 
At each disk cross-section {u } and {f (t) } can conta1n 80 elements. 
r1 1 
Consequently the {q .} can also contain 80 elements. C1 
The force {f(t).} for blade j is, in general, represented by 
1 
{f ( t) .} = f. ( { q .}, { q .}, { qb . . }, { qb .. }) 1 1 C1 C1 1J 1J 
A Parallel Processing Approach 
One approach to simulate the mot10n of the engine (l.e., to solve the dynamic 
equations of motion) is as follows (in this, each box denotes a processor 
and each line denotes transmiSS10n of information between processors). The 
controlling mlnicomputer is not shown. 
-8-
~r!- ~:-I- 3 
--e. ••• 
I - ~I}I 
--.J __ Di-l_J- I Di _ ---..,. ... 
C * 1 
~I_. --4--
. 
I • I c i _ l I ; ci+l, . 
~,---I_ - __ .. . ____  I -J.-[ _J ___ L_~..l..--___ ' ---LI_-_J _ 
Typical Node Processor Assignment 
The array of processor nodes are assigned such that 
)~ 
B. denotes a processor that treats all thp blades on a disk, 
~ 
D denotes a processor that treats a disk or a lumped mass 
~ 
segment of the shaft, and 
C. denotes a processor that treats a portion of the casing 
~ 
(i.e., that portion of the casing associated with a d~sk 
or with a lumped mass shaft segment). 
The first casing yrocessor has tasks beyond those of casing processors 
C2 through Cm' 
-9-
...... 
The functions of each processor are as follows: 
PROCESSORS Dl through D[+l 
In general, [Md1 ] must be evaluated at each time step and then 1nverted 
From the B., 
1 
where J 1, . . ., N.. (For bladed disk only.) Fwe 1. 
averages are received. 
From the casing processor, get {q i} and {q .}, (For bearings, all quant1.-
c C1 
ties may be needed to solve equation (1). For 
a bladed disk, only the rad1a1 d1sp1acements 
will be necessary. The forces of the blades 
on the disk 1n the rad1.a1 d1rect10n can be ob-
tained by curve f1.tting the best ellipse to the 
radial casing displacements and then evaluating 
blade interference via a t1.P circle approach.) 
Equation (1) may then be solved. 
PROCESSORS B1 through BI+1 
For each blade [Mb .. ] can be 1.nverted once and stored 1f, as assumed, th1S 1.J 
matrix is constant (time-independent). 
From d1.sk, get {4di} and {qdi}. (These are needed for the blade force ca1-
cu1at1.on. The radial velocity of the disk 1.S neeessary to compute the 
Corio1is component of circumferent1.a1 blade accelerat1.on.) 
-10-
From casing, get velocities and positions of curve-fitted averages of 
casing points (i.e., best ellipse in plane of cross-section and in an 
axial plane, etc.). Processor must then determine which blades are rubbing 
and what the circumferential and axial forces on those blades are. 
Equation (2) may then be solved. 
PROCESSORS Cl through Cm 
These processors do not solve modal state equation (3). (Only processor 
Cl solves this state equation, see below.) 
From blades (if present at disk i), get best tip ellipse in plane of cross-
section and 1n an axial plane. (This t1P circle is used for part of the 
computatlon of the forces of the blades on the cdsing.) 
From disk get {qdi} and {qdi}' (If disk with blades is located at cross-
section i, the radial displacement of the disk is used for the remaining 
part of the computation of the forces of the blades on the casing.) 
From Cl , get Cr(t), r = 1, ... , M where M is the total number of casing 
modes (say 10). The processor C contains mode shape information for this 
1 
(the ith) cross-section. Therefore {q i} and {q .} can be determined from 
c C1 
C (t). 
r 
The processor Ci determines force of either shaft on casing (using {qdi} 
and {qdi}) or of blades on casing (using these and {qbij} and {qbij }). For 
the latter, blade forces on casing are computed via the tip circle approach 
and V1a compar1son of the best tip ellipses w1th the casing displacements. 
The forces are assigned to nearest casing nodes (nodes in the cross-sect10n 
and circumferentially nearest to rubbing points). The processor Ci then 
determlnes contribution of this cross-section to the right side of equation 
(3) • 
-11-
Processor Cl solves equation (3). Contr1butions to right side of (3) are 
rece1ved from processors C., 1 = 2, •.• , m. The total for each mode is 
1 
assembled and then each modal equation (3) 1S solved. 
Number of Path Variables 
PATH 1 
PATH 2 
PATH 3 
PATH 4 
PATH 5 
~~ 
Time not included. Time 1S tracked by each processor. Timing 
1S controlled by controlling minicomputer (not shown). 
Volume of data transmitted along a path is generally equal in 
both directions. 
Assume rotor modeled as a beam with torsion. Have two displace-
ments and two slopes plus torsion. Result is * 5 x 2 10 scalars. 
* Into D -i 5 average d1splacements. Result is 5 x 2 10 scalars. 
Into B. - best ellipses (4 coord1nates for each plane). Result 
1 * 1S 8 x 2 = 16 scalars. 
Into C. - best ellipses (4 coordinates for each plane). Result 
1 * is 8 x 2 = 16 scalars. 
Into C. - say 4 degrees of freedom for bear1ng/casing 1nteraction. 1 
* Result 1S 4 x 2 = 8 scalars. 
* Into C say 10 modes. Result is 10 x 2 = 20 scalars. 1 
* From C. - one contribut10n for each mode. Result 1S 10 x 2 
1 
20 scalars. 
Factor 2 is for veloc1ties. 
-12-
Simulation Procedures 
There are several choices in the method chosen for simulation. These 
choices fall into two categories, explicit methods and implicit methods. 
Explicit methods require equations of motlon in first order form. Velocities 
and accelerations are obtalned at each time step and then integrated. The 
lntegration is separate from the equations of motion. These methods are 
relatively slmple to implement. Two explicit methods are the Runge-Kutta 
and the Predictor-Corrector method. 
In the Runge-Kutta method, more solutions of equations of motion are necessary 
at each time step for a given accuracy. There is no inherent measurement of 
error. No hlstory of the problem is required. It is self starting, rela-
tively simple and easy to change time steps. 
In the Predictor-Corrector method, fewer solutions of equations of motion 
at each time step for a given measure of accuracy are necessary. There is 
an inherent measure of error. A history of the problem is required. It is 
not self starting, relatively complicated and relatively hard to change 
time steps. 
Impllcit methods require equations of motion In second order form. Velocities 
and accelerations are represented by flnite differences so that displace-
ments are obtained directly at each time step. The integration procedure 
is closely coupled with the equations of motion. These methods are rela-
tively complicated. 
The Runge-Kutta method has been chosen as the most suitable for the struc-
tural dynamics slmulator because it is self starting, does not require a 
history and is relatively simple to program. 
-13-
SAMPLE PROBLEM 2 - DETERMINATION OF PROCESSOR COMPUTATIONAL CAPABILITIES 
Th1S sample problem 1S llnear and 1S developed to determ1ne the funct1ons, 
the memory, communications, 1nstructions and speed desired of the parallel 
process1ng system. 
Largest Linear problem 
The physical problem discussed will be the largest llnear problem that the 
system is des1gned to accept. Computational requirements for the llnear 
problem can be assessed in a straightforward manner. Extent and nature of 
nonl1near1ties are not known so that the1r computational requirements can-
not be established. However, computation capab1l1tes for typical nonlinear 
calculat10ns w1ll be available Slnce the actual associated llnear problem 
w1ll normally be much smaller than this largest linear problem. To allow 
for nonlinear computations, the following functions at a minimum, should be 
available: 
-1 ..... ,.--
sinx, tan x, V x. 
Less 1mportant, but also desirable, are the funct10ns aX and log x where a p 
and x are real numbers and where p is either 10 or e. 
The ovC'rnll equntlon for the entire lmear physical problem bemg slmulated 
can be written in the form 
[M] {q} + [C] {q} + [K] {q} {f (t)} (4) 
where {q} is a vector conta1ning N elements. The matrices [M], [C], and 
rK] are the mass, damping, and stiffness matrices, respectively. The 
vector {f(t)} conta1ns N elements whose values, at any time, can be computed 
directly. 
T1me Dependence of Matrices 
The matrices [M], [C], and [K] can contain elements whose values are functions 
of time. 
-14-
This is true because elements of [M] which vary with time can arise for a 
certain choice in coordinate systems and in their associated coordinates 
{q}. Time variation in elements of [C) and [K] arise when problems con-
taining parametric excitation are considered. 
Banding of Matrices 
The matrices [C) and [K] can be formulated such that they are banded. 
These matrices wl11 be banded for structural stiffness matrices since a 
generalized displacement applied at a node will produce forces at only 
that node and at its neighbors. (Proper node sequencing is required.) 
Diagonal Nature of Mass Matrix 
The matrix [M] is locally diagonal; i.e., the matrix has the form 
! = 0 0 0 0 
- 0 0 
1 
0 0 
- 0 
0 0 0 -
where the shaded regions only have non-zero elements. 
This assumptlon is critical in that it provides for mass decoup1ing of 
regions of the problem. To produce such a mass matrix, each of these 
reglons must be connected to the other reglons only by members having 
stiffness and damping properties but no mass. In addition, the general-
ized coordinates used for each region of the problem must not reference 
coordinates in other reglons of the problem. 
With the above properties, Equation (4) can be put in the form 
-15-
r ' 
, 
" 
I 
0 0 0 
1--1 '0 I 
",0 , 
I' = i , j 
u 
0 0 ! + 
\ I It: q + -I q 0 a 0 I , I I ' , I I \ I: I I i 0 0 ~ , j - -, j. . I o , j ~ 0', 
= f (t) 
(5 ) 
where the bandwidth of the second and third matrices is greater than the 
rank of the largest locally diagonal sub-matrix of the first matrix. For 
the ith region, its equation of motion is that associated with the lth 
shaded region in the first matrix. That equatlon of motlon is 
[M.] C' } + [C ] {q.} + [C ] {q } + [K ] {q } + [K ] {q } l ql l l n n l l n n 
{f.(t)} 
l 
(6) 
where {q.} is the vector for the coordinates of the ith region, and where 
l 
{q } is the vector for the coordinates of the reglons which are neighbors 
n 
to the ith region. The matrix [M.] lS that for the ith shaded reglon of 
l 
the first (the mass) matrlX. The corresponding portions of the second 
(the damping) and the third (the stiffness) matrices are denoted as [C.] 
l 
and [K.], respectively, and couple the equations for the ith region to 
l 
those for the nelghboring regions. It is noted that matrlces [M.], [C ], 
l l 
and [K ] are square while the matrlces [C ] and [K ] are not square but 
l n n 
have the same number of rows as do the i matrlces. Also, the vector 
{f(t)} has the same number of rows as do the i matrices. 
Equation (6) can be put in first order form by defining "state" variables 
{Z} such that 
-16-
so that (6) becomes 
I 0 
-1 
, 0 0 -I I 0 0 ? Z. = I , ~i -, - --- -- -
0 I 
Mi I 1. I f. (t) K Ci K C Z 1. ) n I n n 
L 
where I is the identity matrix. 
From (7) there results 
IZi\ l -;-Ma{~3t)1 [0 :-I i-:~ 
\ 
0 0 ~l K. : C. --j K I C 1. 1. n n J) I 
- I 
which is the first order equat1.on that will be solved by the node processor 
assigned to region i in the simulation system. The computational require-
ments for that node processor are determined below from this equation. 
The largest linear problem that can be contained in the processor reg1.on i 
1.S onE' in WhICh the matrices [M.], [C.], [K.], [LZ], and [C ] are full. 1. 1 1. n n 
Therefore, treatment of matrices having rank equivalent to the number of 
-17-
(7) 
(8) 
generalized coordlnates for region i can be required. Also, the matrix 
[M.] must be lnverted at every evaluation of {Z} in Equation (8). As a 
1 
result, lt may be necessary to lnvert at each time step a matrix ({M }) 
1 
having rank equivalent to the number of generalized coordinates for region i. 
Number of Generalized Coordlnates 
The 1imlt of 200 for the number of generalized coordinates for region i 
was chosen arbltrari1y. ThlS number determines the local data memory Slze 
and the maximum time required to solve Equation (8). 
Given that {q.} can have 200 elements, the mass matrix for reglon i can have 
1 
rank 200. From reference [1], approximate numbers for mu1tip1ications/dlVi-
sions and for addltions/subtractions to invert a 200 x 200 matrix are 200 3 
and 200(199)2, respectively. Carrying out the indicated mu1tip1icatlOns 
gives approxlmate1y 8 x 106 mu1tip11cations/divi<Hons and 8 x 106 addltions/ 
subtractions. 
It should be noted that other computations are necessary in thlS linear 
problem besides the matrix inversion. Numerous matrlx mu1tip11cations 
are also to be made, prlmari1y those of a vector by a matrix. The numbers 
for mu1tlp1ication/divisions and for additlons/subtractions requlred ln 
multiplying by an m x p matrix by an n x m matrlx are n·m·p. The mu1tl-
(a column vector) by a 200 x 200 matrix 
4 x 104 mu1tip1ications/dlvlsions and 
p1icatlon of a 200 x 1 matrix 
therefore requires about 200 2 
4 x 104 additlons/subtractions. This is negligible in comparison to the 
number of operations associated with the inverSlon of the mass matrlX. 
Consequently, the inversion requirements are used to establish the overall 
mu1tlp1ication/dlvlsion and additlon/subtraction requirements for the maxi-
num linear problem to be treated by the processor for reglon i. 
Externally Communicated Coordlnates 
The maximum coordinates for external communication from region 1 are 25% 
of the generalized coordlnates used for reglon i. 
-18-
The ratio of external to local coordinates can vary considerably with the 
specific problem being considered. A problem, consisting of two complex 
regions, connec ted by a few simple springs will he associated with a low 
ratio for each region. A problem consisting of legions intimately connected 
at many points will be associated with a large ratio for each region. For 
readily envisioned problems, the 25% ratio appears reas-onab1e and one which 
need rarely be exceeded. 
With this assumption, a maximum of 25% of the state variables for region i 
need be communicated to neighboring regions. Since there are 400 state 
variables maximum, 100 variables can be communicated to (and from) neigh-
boring processors. 
[K ] is 200 x 50. 
n 
As a result, the maximum sizto of each matrix [e ] and 
n 
With the size of [e ] and [K ] established, the sizes of all the matrices 
n n 
ln Equation (8) for region i are known. These sjzes are: 
[Mi ] 200 x 200 
[Ki ] 200 x 200 
[e i ] 200 x 200 
[e ] 200 x 50 
n 
[K ] 200 x 50 
n 
{f.(t)} 200 x 1 
1 
{Z.} 
1 
400 x 1 
[Z ] 100 x 1 
n 
{Zi} 400 x 1 
It should be noted that thlS memory space for the largest linear problem 
does not lnc1ude memory for lntermedlate resu1tb ~nvo1ved ln matrlx manip-
ulations (e.g., inversion and mu1tip11catlon), nor does it include program 
memory. 
-19-
Nearest Neighbor Communications 
A maximum of 6 neighboring processors will exchange state variable information 
with the processor for region i. 
The number of neighboring processors was arb~trarily chosen, and corresponds 
to the physical elements which can apply forces or moments to the phys~cal 
element represented by the processor for region L. Six ~s sufficiently 
large to allow a complex structure, but also the smallest number which can 
be used to treat a three-d~mensional structure in a symmetric manner. It 
is noted that as many as 100 variables can be transmitted along any of 
these paths (if no var~ables are transmitted along the other 5 paths). 
If all 6 paths are transmitt~ng ~nformat~on and Lf the number of variables 
transmitted on each path is the same, then each path will carry 100/6 
variables. 
Summary of Node Processor Requirements 
Maximum computations per der~vative equation (Equat~on (8» per 
processor 
Multiplications/D~visions 
Additions/Subtractions 
Number of generalized coordinates per 
processor 
Number of state variables per processor 
Approximate numerical memory size (in 
floating point numbers) per processor 
(not including that for matrix manipu-
lation) 
Number of state var~ables transmitted 
to and from each processor 
-20-
200 
400 
150,000 
100 
Number of separate transmission paths from 
each processor to neighbor 
Primary desirable functlons 
Secondary desirable functions 
-21-
6 
matrix operations 
sin x 
-1 tan x 
x 
a. 
log x 
SOFTWARE REQUIREMENTS 
The Digital System for Structural Dynamic Simulatlon, as can be seen by the 
previous discusslon, has the requirement for a great varlety of software. 
This software can be broken into five segments. 
1. Offline programs for the central minicomputer to process user 
specifications of models, to prepare Node Processor programs for 
executlon, and to aid ln the development of Node Processor opera-
tional programs. 
2. Realtime programs for the central minicomputer to load Node 
Processor programs and data, to coordlnate executlon of the 
model and to provide operator controls and dlsplays. 
3. Operatlonal programs for the Node Procesbors to cause them to 
slmulate substructures of the model and to communicate with 
neighbors and the central minicomputer. 
4. Microcode (sometlmes called firmware) to deflne the lnstruction 
set of the Node Processors. 
5. Diagnostlc Software. 
HARDWARE REQUIREMENTS 
Hardware requirements for the slmulation hardware are llsted below. 
1. An array of Node Processors to perform the simulatlon. 
2. Node Processors properties include: 
a) A large local memory for program and data storage. 
b) Fast instruction times. 
c) Auxillary hardware for fast floating point multipli-
cation, additlon, subtraction, and division. 
d) Mlcrocoded processor for a custom lnstruction set. 
e) Parallel communication paths to the six nearest 
neighbor Node Processors. 
3. A central control minlcomputer to direct the operatlon of the Node 
Processors and develop all associated software. 
-22-
OVERVIEW OF SYSTEM ARCHITECTURE - SOFTWARE 
The D1gital System for Structural Dynamics Simul~tion consists of a three 
dimensional array of processors controlled by a lnin1computer. Each processor 
in the array 1S capable of communicating directly with its six adjacent 
processors and with the minicomputer controller. The term Node Processor 
is used to identify a processor in the array. 
The individual Node Processors are logically org~n1zed as a three d1men-
sional array with each node having an address specified by three subscripts 
[e.g.: P(5,2,3)]. Each Node Processor will have a bidirect10nal communi-
cations link with the adjacent processor in both direct10ns for each dimen-
sion. The first and last Node Processor 1n each dimenS10n will be linked 
to make each dimension a full circle of processors (a hypertorous). The 
system has been designed with five proces~ors in each dImension for a total 
of 125 processors. Processors will be numbered from 1 to 5 in each dimension. 
For example, P(2,5,3) will communicate directly with processors P(1,5,3), 
P(3,5,3), P(2,4,3), P(2,1,3), P(2,5,2), P(2,5,4); see F1gures 2 and 3. 
All of the processors will be connected to the minicomputer based controller 
with a bidirect10nal data and control bus. Data being sent from the con-
troller to the processors will be prefaced w1th address information. Only 
the Node Processor(s) which need the data w1ll store it in its local memory. 
When several Node Processors must transmit data over the common bus, 
the controller will command one Node Processor at a t1me to put data on 
the bus for use by the controller and/or other processors (Figures 4 and 5). 
CONTROLLER OPERATING SYSTEM SOFTWARE 
The software for using the Digital System to perform structural dynam1cs 
simulations w1ll consist of the follow1ng five segments. 
1. Offline programs for the central minicomputer to process 
user specifications of models, prepare Node Processor programs 
for execution and aid in the development of Node Processor 
operational programs. 
-7~-
5x5x5 
NODE PROCESSOR ARRAY 
INTERPROCESSOR 
COMMUNICATIONS 
PATHS 
REF. 3.2 
Figure 2 NASA/Lewis Structural Dynamics Simulator Block Diagram 
-24-
\ 
t'l 
V' 
\ P,oW 
R .. N~ 
pORT" 
tOl-ulA'" 
pORT ,.. 
RoW L 
OOp (25 ~ LOOPS) 
R,t.l'i'o( 
pOp,T a 
16 BIT BI-DIRECTIONAL RANK LOOP (25 LOOPS) ='~.,;;[ o;~ 
<J 
~ASSf 'FW;S2~ 
o 
0 0 AX 5 ~. M ESSORS 
l"'---r f'A:?-' NODE PROC 
16alT 
BI_DIRECTIONAL 
COLUMN LQOP 
(25 LOOPS) 
~PASS I 
DEEP 
C-OI.U)AN 
pORT a 
",(1H 
pORT \I 
MAX 5 SORS ~ IV~~~SS 2 PROCES 0 
NOD'","" ~ 
reI£. pRQCESSOR 
suB DETAIL 
Figure 3 50S ~oIFrO~ Sb< Nearest Block U.sgram 
32 - BIT CONTROL 
MINICOMPUTER 
GLOBAL DATA AND CONTROL BUSSES 
NODE 
PROCESSOR 
1 
NODE 
PROCESSOR 
2 
NODE 
PROCESSOR 
3 
NODE 
• • ,. PROCESSOR 
N 
(NS,125) 
SYSTEM BLOCK DIAGRAM 
REF. 3.1 
Flgure 4 Structural Dynamics Simulator Block Diagram 
-26-
I 
N 
....... 
I 
CENTRAL 
MN-COIrFUT£R 
G.OB4L 
9.JS 
~--,l 
G.OBAL BUS 
CONTRa. 
rNoDi OOcEssOR - - - - - - 1 
I 
I 
I 
L 
GLceAL 
BUS 
I 
.J 
NODE PROCESSOR - GLOBAL BUS 
INTERFNE SUB CETAIL 
MJrES 
I) AR*)WS /NDICATE All £XANPL£ OF FLOW. WHD/ CNI' NOO£ 
(~ IN SLICE Z) TALKS TO ,ANOTHER. NODE CNOOL IN 
SLlC£ 5) NOT. THE DlIlf.£CTION OF' DI/IJ\Ifl"S AND THAT 
ALL OTHeR M::IOIES LI~TEN 
Eigure 5 SDS Global Bus Interface Block Diagram 
2. Realtime programs for the central minicomputer to load Node 
Processor programs and data, coordinate execut10n of the 
model and prov1de operator controls and displays. 
3. Operational programs for the Node Processors to cause them 
to simulate substructures of the model and communicate 
with neighbors and the central minicomputer. 
4. Microcode (sometimes called firmware) to define the instruct10n 
set of the Node Processors. 
5. Diagnostic Software. 
OFFLINE MODEL DEVELOPMENT SOFTWARE 
The off11ne software will consist of a linker and assembler for generating 
the specific code to be executed by a Node Processor and a compiler for a 
Model Specification Language to ass 1St users in programming the application 
software. 
Node Processor Assembler 
The Node Processor Assembler program will translate the symbo11c assembly 
language programs written for the Node Processor into object modules that 
may be linked to other modules and form an executable memory image for 
the Node Processor. This assembler should have moderate macro capabilities 
and a full complement of assembler directives so as to ease the task of 
programming at the machine level. Also, the assembler should 1nterface 
easily with the higher level Model Specificat10n Language compiler. 
Node Processor L1nker 
The Node Processor L1nker will take one or more object modules created by 
the Node Processor Assembler and produce the memory image to be loaded into 
the Node Processor for execution. It is the linker's task to resolve all 
inter-module address references and relocate all intra-module addresses 
based on the location of the module within the executable memory image. 
-28-
The linker should be capable of linking the user defined data tables created 
during model specification with the appropriate Node Processor operdt1onal 
programs. It is also the linker's task to build a symbol table for use by 
the Execution Control Program. The symbol table will be used when loading 
the Node Processors with data tables and functions when the operational 
program is already resident in the Node Processor memory. Also, the symbol 
table will be used during debugging and crash dump analysis. 
Model Spec1fication Language Compiler 
A high order language will be developed which will consist of a set of 
rules governing the definition of models and forcing functl0ns. The compiler 
program (sometimes called translator) will trans1ate the model speclfication 
into the data tables and functions needed by the Node Processor operational 
programs. 
The user will use the minicomputer text editor to create and modify text 
files containing the model specification. For 11near problems, the bulk 
of the model specification will consist of numerical values for various 
matrlces in the state equations. For non-linear problems, the user may 
need to specify a function for each element. These functl0ns will be 
speclfied in Fortran-like arithmetlc expressions. Other statements in 
the language will allow the user to specify the coupling between substructures, 
the forcing functions and the desired outputs. 
The compiler program will read the text file containing the user's speci-
fication of the model and perform the following operations. 
1. Assign substructures to processors using an algorithm to 
balance the processing load of each processor and optimize 
interprocessor communications. The compiler will be told 
how many Node Processors are operational in each dimension 
of the processor array. If there are more substructures 
than processors, substructures with the greater interaction 
will be assigned to the same processor. 
-29-
To ensure proper balanclng of the processing load the user 
will be able to specify the relative execution times for 
various substructures. The assignment '3ubprogram will use 
the relative execution times as a weightlng factor when 
assigning more than one substructure to a single processor. 
An iterative algorithm would be the slmplest approach to 
maklng the best assignments. 
Once the substructures have been combined, the compiler will 
optimize the lnterprocessor communlcations. Interacting 
substructures will be asslgned to adjacent processors so 
that as much communication as possible can occur in parallel. 
2. Generate the data tables each Node Processor operatlonal 
program wll1 use to solve the state equltlons for the sub-
structures assigned to it. 
3. Generate the tables to specify to each processor the inputs 
from each of its six adjacent processors, the inputs from 
the bidirectional data bus, the outputs to each of its six 
adjacent processors, and the outputs to the bidlrectional 
data bus. 
4. Generate a data fl1e for the realtlme central mlnlcomputer 
program to specify the forclng function and the system outputs. 
The compiler produces the required data modules which are loaded with the 
appropriate operational program into the Node Processor for execution. 
This makes the structure of the operational programs dependent on the output 
of the Model Specificatlon Language complIer, but makes the complIer some-
what machine independent. Only the functions generated will be actual 
machine code. These functions will be specified in Fortran-like state-
ments, thus making the language syntax independent of the Node Processor. 
The compiler's code generating functions will be tailored to the Node 
Processor. 
-30-
The instruction set of the Node Processor was designed to assist the task 
of automat1cally generating code from Fortran arLthmetic expressions. The 
data format for floating point numbers was chosen to conform to the popular 
PDP-II data format. To simp11fy the generation of the data tables as de-
scribed above, many addressing modes are available to the Node Processor 
for building address llStS as needed. 
REALTIME MODEL EXECUTION SOFTWARE 
The realtime software will consist of a debugging facility for developing 
operational programs and an execution control program for doing actual 
s1mulations. 
Executl0n Control Program 
The Execution Control Program will be responsible for loading the Node 
Processors with operational programs, controlling the network of processors 
during execution, collecting data for animated displays as the execution 
proceeds and gathering the final results when the simulation completes. 
Loading Node Processors 
There will be two types of program loading for the Node Processors: The 
f1rst type cons1sts of loading the entire Node Processor with its operational 
program and data. The second type consists of only loadlng the data portions 
of the operational program. Typically, when the array of processors is 
brought up, they will be loaded w1th their operat10nal programs and data 
for the first simulation run. Subsequent s1mulations will only be loaded 
1nto the data tables required by the operational programs if the model 
is compatible with the resident programs. 
During the initial program load the Node Processor will be forced to execute 
1ts microcode from location O. The microcode at this location may execute 
a few diagnostics to determine the integrity of its local memory and exercise 
the communications link with the controller. The processor will then walt 
for a command from the central minicomputer controller. The central mini-
-31-
computer will poll the Node Processors to determine which are on line and 
ready. 
When each Node Processor has acknowledged its readiness status, the central 
minicomputer controller will send a command to a processor telling 1t to 
accept the ensuing data stream as a memory 1mage. Following the memory 
image will be the initial program counter value. Typically, th1s value 
will address a program routine wh1ch will wait for the "Start Execution" 
command from the controller. 
After the execution of a simulation is completed, the operat10nal program on 
the Node Processor will enter a routine to wait for additional commands from 
the controller. Subsequent simulat10ns Wh1Ch use the same operat10nal pro-
gram w1ll then need only to load the data portion of the programs. The 
operational programs should be table driven as much as possible to accommo-
date thlS scheme. 
Controlllng the Processor Network during Execution 
The Execution Control Program w1l1 broadcast the start slmu1ation command 
to all nodes. The operational programs on the Node Processors will then 
execute the t1me step until 1t needs to commun1cate w1th other processors. 
A control bus line will be raised by the Node Processor when it 1S ready. 
The bus line will only appear active to the controller once all processors 
are ready. The controller will then send the commands to initiate trans-
fers between neighbors. When transfers are completed in the requested 
direct10n, the controller is then notified by each processor. After all 
are ready again, the controller will broadcast to the nodes to start 
transferring data in the opposite direction. 
After the nearest neighbor communications have completed, the controller 
w11l command each node 1n turn to put its global state variables on the 
data bus. All processors will listen to the global bus as state variables 
appear and will store only those state variables from other nodes as it is 
directed by its internal tables. The controller will also store those 
state variables it needs for the graphics display. 
-32-
Once all the state variables have been transferrt~d, the controller may 
then interrogate individual Node Processors for .tdditional data needed 
to update the graphics display. When Node Proce,sors have completed all 
the necessary communicatlons, they will continue on with their simulation 
computations. 
During the communications procedure all on-line ~ode Processors are required 
to signal acceptance of data. If the controller does not receive the accept-
ance signal, the controller will time out. The Executlon Control Program 
will interrogate the Node Processors individually to determine which one(s) 
are not responding. Special status llnes on the control bus may be used to 
selectlvely put a Node Processor offline or cause the processor to go into 
a microcoded diagnostic routine to determine the reason for the failure. 
Collecting Graphics Display Data 
The Execution Control Program will interrogate selected Node Processors 
for data needed to drive a graphics display. The collection of the data 
will be directed by the data tables generated durlng the Model Specification 
compilation. The data transfers from Node Processors to central minicomputer 
will take place between time steps of the simulation. 
Gathering Final Results 
The Execution Control Program wlII collect the final state variables from 
each Node Processor after the simulation has completed. The data collected 
wlII be stored In disk files for analysis by other analysis programs. The 
data tables generated during compilation of the Model Specification will 
direct the Executlon Control Program to command the appropriate variables to 
be sent from a node to the host. 
Node Processor Program Debugger 
A Node Processor Program Debugger facility is needed to assist in the develop-
ment of the operational programs. This debugger will perform the functions 
required by the realtime Model Execution Program but will support more 
operator interaction. 
-33-
A speclal library of debugger modules will be available to the program 
developer. These modules will be linked to operational programs under 
development to support features such as instructlon execution traclng, 
breakpoints and memory examinatlon/modification. 
When programs are linked to the debugger modules and loaded into the pro-
cessors with the Node Processor Debugger Facility, the operator may inter-
act with the execution of the program. ThlS powerful facllity wlll greatly 
increase the system programmer's productivity. 
NODE PROCESSOR OPERATIONAL PROGRAM 
The Node Processor Operational Programs are those programs which wlll cause 
each processor to simulate a substructure and communlcate with its nelghbor 
and the central minicomputer. The programs will be written in assembly 
language or possibly a Fortran-like language that is easily translated 
into assembly language code. 
The content of these programs will be highly dependent on the actual model 
being used for the simulation. The programs should be structured so as to 
operate on the data tables that are generated by the Model Speclficatlon 
compiler. The data tables and special functions generated by the complIer 
must be easily linked with the Operatlonal Program to perform the requlred 
simulation. 
NODE PROCESSOR MICROCODE 
The Node Processors will be microcoded to provide a typical general purpose 
minicomputer lnstruction set. In addition to the general purpose instruction 
set, special hlgh-Ievel microcoded routlnes will be avallable for dOlng 
matrix and vector operations, communicating with nearest neighbors and the 
central minicomputer and providlng support for dlagnostlcS. The microcoded 
instruction set is described in the Node Processor Instructlon Set Reference. 
-14-
DIAGNOSTIC SOFTWARE 
Diagnostic software will be implemented at the following three levels. 
1. Diagnostic Control Program in the central microcomputer. 
2. Diagnostic Program in each processor. 
3. Diagnostic Machine instructions implemented ln microcode. 
The Diagnostic Control Program in the central minicomputer will load the 
Diagnostic Program into each processor, accept operator direction, issue 
control commands to the processors, input results from the processors, and 
display the results to the operator. 
The Diagnostic Program in each processor will run various tests as commanded 
by the central minicomputer. The tests will progress from simple ones which 
test a minimum of circuitry to more complex ones. Tests will be included 
for the interprocessor communications, bidirectional bus communications, 
and the internal circuitry on each microprocessor board. 
The Dlagnostic Machine instructions will be designed to thoroughly exercise 
the processor board circUlts and aid in fault isolation. 
NODE PROCESSOR ARCHITECTURE 
The Node Processor is part of an array of processors, each capable of 
communicating wlth a central host computer and dtrectly with six neighbors. 
It consists of a bit-slice 32-bit CPU, up to 256K of 64-bit memory, 256 
72-bit floating point scratch pad registers and Eloatlng point units capable 
of overlapping multiplication with either addition, subtraction or division. 
The instruction set of the Node Processor includ~s a full complement of 
lnstructions typically found on a general purpose minicomputer plus additlonal 
high level lnstructions to handle vectors and matrices. 
Each instruction for the Node Processor is 64 bits and contains an opcode 
and up to 2 operand fields. There are 13 possible addressing modes per 
-35-
operand w1th a min1mum of restrictions on the addressing modes available 
for each instruction. With over 130 opcodes and up to 13 addressing modes 
per operand there are over 10,000 distinct lnstructions def1ned. This 
address1ng scheme gives the Node Processor a very versatile and powerful 
1nstruction set. 
MEMORY 
Ma1n memory appears as 64 bits to the machine level programmer. The internal 
data bus connecting memory with the bit-slice CPU and floating point scratch 
pad memory is 32 bits w1de, therefore 2 transfers are made to access a 64-bit 
word and only 1 transfer for a 32-b1t half-word. The number of memory trans-
fers required by a processor instruction is dependent on the type of data 
being man1pulated. The actual transfers made are under control of the micro-
code and are not the responsibil1ty of the mach1ne language programmer. The 
mach1ne language programmer is capable of address1ng memory only at 64-bit 
word boundaries, thus there is never the possibility of addressing memory at 
a half word boundary. If memory were addressed at 32-bit boundar1es under 
program control, it is possible that alignment problems could arise when 
addressing 64-bit floating p01nt quantities. Th1S problem 1S elim1nated 
since all programs are restr1cted to address1ng at 64 b1t boundaries. 
The memory is I-bit error correctlng, 2-bit error detecting. If the Node 
Processor detects a 2-bit error during execution of an instruction a trap 
V1a locat1on 4 is performed. One bit errors are corrected by the hardware 
and the currently executing instruction is completed normally. 
There is a memory address comparison register (MACR) which is accessed 
whenever memory is written. When memory is written the address being used 
is compared to the value in MACR and the appropr1ate bits 1n the processor 
status word (PSW) are set. 
Main memory is used to hold data and program instruct1ons. With the except10n 
of the lowest 70 words of memory, instruct10ns and data may be anywhere in 
memory. Memory locations 0 through 69 are used for processor traps and 
-36-
holding constants required by certain high level instructions. Main memory 
size is either 256K, 512K, 768K or 1 million 32-r.it words. Each memory 
board in the system may hold 256K 32-bit words (J28K 64-bit words). 
DATA TYPES 
There are 4 data types handled by the Node Procehsor: integer, floating 
point, vectors and matrices. Integer data is 32 bits wide and left justi-
fied in a 64-bit memory word. The LSB of the inleger is at bit 32 while 
the MSB is at bit 63. The integer data type is used to represent numerical 
quantities and hold memory addresses. When used as an address, only the 
low order 19 bits of the integer are used. However, all 32 bits are used 
for calculating memory addresses and no checks are made to ensure that the 
unused portion of the integer is all zeroes. 
Floatlng point data is 64 bits in main memory and 72 bits in the floating 
point scratch pad memory. The increased scratch pad representation allows 
for 7 extra bits of precision in the mantissa. When transferrlng main memory 
to the scratch pad registers, the 64-bit number must be expanded to 72-bits. 
The sign and exponents wlll be the same in both formats. Bit 62 is set 
to 1 in the scratch pad register If the exponent 1S non-zero. The main 
memory mantissa lS mapped to bits 61-7 of the scratch pad register and 
bits 6-0 are set to 0 to complete the mantissa. When going from the scratch 
pad registers to main memory, the sign and exponent are copied directly and 
bits 61-7 are taken as the mantissa. (Note: The MSB bit of the mantissa, 
1 if non-zero number, is not stored in main memory. The mantissa is 
truncated to 55 bits when going to main memory.) 
Vectors are arrays of floatlng point values which reside in consecutive 
locations of main memory. The address of a vector is the location of the 
first floating point value. The length of a vector is the number of floating 
point values making up the vector. 
Matrices are two dimensional arrays of floating point values and reside in 
a contiguous block of main memory. Each row of the matrix is stored in 
consecutive memory locations wlth the first row starting at the beginning 
-37-
address of the matr1X. If there are N elements in a row then the second row 
will start at the matrix address + N. Likewise, the third starts at matrix 
address + 2N and so on until the number of rows in the matrix 1S exhausted. 
The elements of each column 1n a matr1x are separated by N-l memory locat10ns. 
(Note: The autoincrement addressing modes of the instruction set enable the 
programmer to easily access each memory of a column without the need for 
doing complex address calculations.) 
NODE PROCESSOR REGISTERS 
The machine level programmer has access to 8 general purpose 32-b1t 1nteger 
registers, 3 special purpose 32-bit 1nteger reg1sters and 250 72-b1t float1ng 
point scratch pad registers. The general purpose integer registers are 
des1gnated RO through R7 and are used to hold integer numer1cal quant1t1ves 
and addresses. These reg1sters are not affected by the Node Processor 
instruct10n unless spec1f1ed as an operand by tht· programmer. 
The special purpose integer registers are the system stack pointer (SP), 
program counter (PC) and a vector/matrix size register (MS). The stack 
pointer contains the address of an area of memory set aside by programs 
for dynamic storage. The stack grows (increas1ng memory addresses) as 
words are placed (pushed) onto 1t and shrinks (dl'creasing memory addresses) 
as words are removed (popped). The program counter 1S used to spec1fy the 
locat1on from which the next instruction 1S to be taken. The vector/matr1x 
size register stores an 1nteger value Wh1Ch conveys the length of a vector 
or the number of rows of matrix to certain h1gh level instructions. 
The floating point scratch pad registers are des1gnated FO through F249. 
They are general purpose floating point accumulators and are under programmer 
control except for the act10ns of certain high level instructions which may 
destroy the1r contents during execution. 
PROCESSOR STATUS WORD 
The Processor Status Word (PSW) contains 1nformatlon on the current status 
of the Node Processor. Selected portions of the PSW are affected by the 
-38-
execution of instructions. Each instruction described below will detail 
which bits of the PSW are affected during execution. In general, bits 0-3 
reflect the results of the last 32-bit integer operation on the bit-slice 
CPU, bits 4-7 reflect the results from the last floating point multiply 
operations, bits 8-11 reflect the last floating point add/subtract/divide 
result, bits 15-16 reflect the error status of the last memory read, bits 
13-14 are under the programmers control for enabling traps, b1t 17 1S 
concerned with host/node synchronization and bits 18-20 reflect compar1sons 
of memory addresses when doing writes. 
High level 1nstructions affect the PSW in a manner different from simple 
instructions. Since all high level instructions involve multiple operations 
on either or both of the floating point units, the f1nal PSW state reflects 
multiple operations. The only status bits which are meaningful are the 
overflow (FOVl,FOV2) and d1v1de by zero (FDZ) b1ts. If any of these bits 
are set durinR execution of a high level instruction then they will be set 
at the end of instruction. The status of all other PSW b1ts for the 1nte-
Rer and floating point units are undetermined. 
Bit Mnemonic Contents 
0 CR Integer carry 
1 OV InteRer overflow 
2 NE Integer result negative 
3 ZE Integer result zero 
4 FUN 1 FP multiply underflow 
5 FOVI FP multiply overflow 
6 FNEI FP multiply negat1ve 
7 FZEI FP multiply result zero 
8 FUN 2 FP add/subtract/divide underflow 
q FOV2 FP add/subtract/divide overflow 
10 FNE2 FP add/subtract/divide result negat1ve 
11 FZE2 FP add/subtract/divide result zero 
12 FDZ FP divide by zero 
13 FTR Trap enable on FDZ, FOVl, or FOV2 via 
location 10 
-39-
14 TRC Trace enable. Trap after each ~nstruction. 
15 PAR One bit error from memory access 
16 RER Two bit error from memory access 
17 RDY Write only. Signals to host this processor 
is done w~th current step. 
18 BPT Enable trap on setting of FEN 
19 FEN Memory write address equals MACR 
20 FLT Memory write address less than MACR 
INPUT/OUTPUT 
The Node Processor communicates w~th the central host computer and each of 
its six nearest ne~ghbors. I/O is synchronous ~n that there are no interrupts 
caused by I/O transfers. Each Node Processor program must explic~tly ~nvoke 
an I/O command before any data transfer. 
PROCESSOR TRAPS 
There are a number of exceptional conditions which cause the Node Processor 
to trap to f~xed locations. These conditions may be due to hardware or 
software failures. When a trap is executed the error condit~on w~ll cause 
the processor to push the present PSW and PC onto the system stack and take 
the new PSW and PC from consecut~ve memory 1ocations. The memory locat~ons 
and cond~tions are listed below. It is the software's respons~bil~ty to 
load the proper addresses for the trap service routines. 
Trap Cond~t~on 
Two-bit Memory Failure 
Illegal Instruction 
Out of Limits (CLIM) 
FTR and (FOVI or FOV2 or FDZ) 
Attempt SQRT with Negat~ve 
Stack Underflow 
Trace Enabled 
FEN and BPT 
-40-
Locat~on 
4 
6 
8 
10 
12 
14 
16 
18 
INSTRUCTION FORMATS 
The Node Processor instruction length is 64 bits. Each instruction contains 
an 8-bit opcode field which allows up to 256 distinct opcodes. Since less 
than 256 instructions have been defined, there are opcodes left for future 
instructions. Instructions may require zero or flore operands for execution. 
For instructions with 2 or less operands, each operand is completely speci-
fied wlthin the instructlon word. A 4-blt addre3s mode, 4-bit register 
select and a 20-bit field are contained in the lnstruction to speclfy the 
effective address of the operand. The 20-bit quqntity may represent an 
index, immediate operand, actual address, floating point register select, 
general register to autoincrement by or autoincrement value as required by 
the addresslng mode. In the description of instruction formats below, thlS 
20-bit quantity is always labeled index though it may represent something 
else depending on the addressing mode. 
For instructl0ns wlth 3 or more operands the first 2 operands will be speci-
fied as ln the 2-operand instruction while the rest of the operands will be 
taken from pre-specified registers as defined by each instructlon. The 
general format of the 64-bit instruction is: 
! I I t OPCODE FLDI I FLD2 Fill 3 FLD4 l FLD5 FLD6 j I l I 
63 55 51 47 43 19 19 o 
For no operand lnstructions only the 8-blt opcode field is used. For 
I-operand instructions FLDI specifies the operand addressing mode and FLD2 
specifies the general register to be used if required by the addressing mode. 
If the addressing mode requires an index, displacement, actual address or 
autoin~rement value then it is contained in FLD5. The general format for a 
I-operand instruction is: 
OPCODE MODE REG INDEX 
63 55 51 47 43 39 19 o 
-41-
For 2-operand instructlons, the first operand is specified as in the 
I-operand instruction. The second operand uses FLD3, FLD4, FLD6 to specify 
addresslng mode, register select and index respeetively. The format of a 
2-operand instruction is: 
OPCODE MODI REGI MOD2 REG2 INDEXI INDEX2 
63 55 51 47 43 39 19 o 
The uses of the fixed fields in the 64-bit lnstructl0n are outllned below. 
Instruction 
Field 
OPCODE 
FLDI 
FLD2 
FLD3 
FLD4 
FLD5 
FLD6 
Blts 
63-56 
55-52 
61-48 
41-44 
43-40 
39-20 
19-0 
Contents 
Operation code. (0-255) 
First address mode. (0-12) 
Reglster select. 0-7 for RO 
through R7. 8 for MS, 9 for 
SP, 15 for PC (Note: 10-14 
not allowed. These registers 
are dedicated to the microcode.) 
Second address mode (0-12) 
Second regi'3ter select as in FLD2 
First address lndex value, immedlate 
operand, actual address, reglster 
select (as In FLD2) with auto-
increment vdlue or F.P. register 
selec t (0-21+9) 
Second address lndex value, imme-
dlate operand, actual address, 
register select wlth autoincrement 
value or F.p. reglster select 
(0-249) . 
-42-
NODE PROCESSOR INSTRUCTIONS 
OPCODE OPERANDS 
Integer Instructions: 
MOVE SRC,DST 
BLKM ADRl,ADR2 
ADD SRC,DST 
SUB SRC,DST 
MUL SRC,DST 
DIV SRC,DST 
JSB Rn,ADR 
RSB Rn 
PUSH SRC 
POP SRC 
SAVEM Rn,CNT ,ADR 
LOADM ADR,Rn,CNT 
JMP ADR 
JMPLE ADR 
JMPLT ADR 
JMPGE ADR 
JMPGT ADR 
JMPEQ ADR 
JMPNE ADR 
SKPLE SRCl,SRC2 
SKPLT SRCl,SRC2 
SKPGE SRCl,SRC2 
SKPGT SRCl,SRC2 
SKPEQ SRCl,SRC2 
SKPNE SRCl,SRC2 
TSTLE SRC,ADR 
TSTLT SRC,ADR 
TSTGE SRC,ADR 
TSTGT SRC,ADR 
TSTEQ SRC,ADR 
TSTNE SRC,ADR 
BITNE SRCl,SRC2 
BITEQ SRCl,SRC2 
SJGT SRC,ADR 
TRAP ADR 
RTP 
SPS SRC 
CPS SRC 
RPS DST 
TPSNE SRC 
TPSEQ SRC 
SWAP SRC,DST 
ASHL CNT,DST 
DESCRIPTION 
Move integers 
Move block of integers 
Add in tegers 
Subtract integers 
Multiply integers 
Divide integers 
Jump to subroutine 
Return from subroutine 
Push onto stack 
Pop off stack 
Save multiple registers 
Load multiple reg1sters 
Unconditional jump 
Conditional jump if <= 
Conditional jump if < 
Conditional jump if >= 
Conditional Jump if > 
Conditional jump if 
Conditional jump if <> 
Compare and skip if SRCI <= SRC2 
Compare and skip if SRCI < SRC2 
Compare and skip if SRCI >= SRC2 
Compare and skip 1f SRCI > SRC2 
Compare and skip if SRCI = SRC2 
Compare and skip if SRCI <> SRC2 
Test and jump if <= 0 
Test and jump if < 0 
Test and jump if >= 0 
Test and jump if > 0 
Test and jump if = 0 
Test and jump if <> 0 
Test bit(s) and skip if not zero 
Test bit(s) and skip if zero 
Subtract and jump if positive 
Trap 
Return from trap 
Set processor bits 
Clear processor bits 
Read processor bits 
Read PSW and skip if selected bits are set 
Test PSW bits and skip if 0 
Swap words 
Arithmetic shift left 
-43-
ASHR 
LSHL 
LSHR 
AND 
IOR 
XOR 
COM 
NEG 
CLR 
INC 
DEC 
ADC 
ABS 
CEA 
CLIM 
CASE 
ISKP 
DSKP 
CNT,DST 
CNT,DST 
CNT,DST 
SRC,DST 
SRC,DST 
SRC,DST 
SRC,DST 
SRC,DST 
DST 
DST 
DST 
DST 
SRC,DST 
SRC,DST 
LO,HI 
SRC, HI 
SRC,DST 
SRC,DST 
Floatlng Point Instructions: 
FADD 
FADD 
FADD 
FADD 
FSUB 
FSUB 
FSUB 
FSUB 
FMUL 
FMUL 
FMUL 
FMUL 
FDIV 
FDIV 
FDIV 
FDIV 
FMOV 
FMOV 
FMOV 
FMOV 
FBLK 
FPUSH 
FPOP 
FSAVE 
FLOAD 
FSKPLE 
FSKPLT 
FSKPGE 
FSKPGT 
FSKPEQ 
FSKPNE 
Fm,Fn 
Fn,DST 
SRC,Fn 
SRC,DST 
Fm,Fn 
Fn,DST 
SRC,Fn 
SRC,DST 
Fm,Fn 
Fn,DST 
SRC, Fn 
SRC,DST 
Fm,Fn 
Fn,DST 
SRC,Fn 
SRC,DST 
Fm,Fn 
Fn,DST 
SRC, Fn 
SRC,DST 
ADRI,ADR2 
SRC 
DST 
Fn ,CNT ,ADR 
ADR, Fn ,CNT 
SRCI,SRC2 
SRCI,SRC2 
SRCI,SRC2 
SRCI,SRC2 
SRCI,SRC2 
SRCI,SRC2 
Arithmetlc Shlft right 
Logical shift left 
Loglcal shift right 
Loglcal and 
Incluslve or 
Exclusive or 
Complement 
Negate 
Clear 
Increment 
Decrement 
Add carry 
Absolute value 
Compute effective address 
Compare against llmits 
Case statement 
Increment and skip if wlthin limlt 
Decrement and skip lf wlthin limit 
Floating add, reglster to reglster 
Floating add, register to memory 
Floatlng add, memory to register 
Floating add, memory to memory 
Floatlng subtract, register to reglster 
Floating subtract, reglster to memory 
Floating subtract, memory to register 
Floating subtract, memory to memory 
Floating multlply, register to reglster 
Floating multiply, reglster to memory 
Floating multiply, memory to register 
Floating multiply, memory to memory 
Floating dlvlde, register to register 
Floating divide, register to memory 
Floating divide, memory to register 
Floating divide, memory to memory 
Floating move, reglster to register 
Floating move, register to memory 
Floating move, memory to register 
Floating move, memory to memory 
Floating block move 
Push floating 
Pop floating 
Save floating pOlnt registers Fn to F<n+CNT-I> 
Load Floating point registers from memory 
Compare floating and skip lf SRCI <= SRC2 
Compare floating and sklp lf SRC < SRC2 
Compare floating and skip if SRCI >= SRC2 
Compare flodtlng and SklP lf SRCI > SRC2 
Compare floating ~nd skip lf SRCI SRC2 
Compare floating and SklP if SRCI <> SRC2 
-44-
FJMLE 
FJMLT 
FJMGE 
FJMGT 
FJMEQ 
FJMNE 
FJLE 
FJLT 
FJGE 
FJGT 
FJEQ 
FJNE 
FTSTLE 
FTSTLT 
FTSTGE 
FTSTGT 
FTSTEQ 
FTSTNE 
FABS 
ADR 
ADR 
ADR 
ADR 
ADR 
ADR 
ADR 
ADR 
ADR 
ADR 
ADR 
ADR 
SRC,ADR 
SRC,ADR 
SRC,ADR 
SRC,ADR 
SRC,ADR 
SRC,ADR 
SRC,DST 
Hlgh Level Instructions: 
SIN 
COS 
AT AN 
SQRT 
MTMUL 
MTADD 
MVMUL 
SOLVE 
MSVEC 
ASVEC 
VCMUL 
VCADD 
MTMUL2 
MTADD2 
MVMUL2 
RKMI 
RKM2 
RKM3 
RKM4 
RKM5 
RKERR 
SRC,DST 
SRC,DST 
SRC,DST 
SRC,DST 
MATI,MAT2 
MATI,MAT2 
MAT,VEC 
MAT,VEC 
SRC,DST 
SRC,DST 
SRCI,SRC2 
SRCI,SRC2 
MATI ,MAT 2 
MATI,MAT2 
MAT, VEC 
SRCI,SRC2 
SRCI,SRC2 
SRCI,SRC2 
SRCI,SRC2 
SRCI,SRC2 
SRC ,DST 
Branch if FP multiply result <= 0 
Branch if FP multiply result < 0 
Branch if FP multiply result >= 0 
Branch if FP mult1ply result > 0 
Branch if FP mult1ply result = 0 
Branch if FP multiply result <> 0 
Branch if FP add, divide or subtract result <= 0 
Branch if FP add, divide or subtract result < 0 
Branch 1f FP add, d1vidc or subtract result >= 0 
Branch 1f FP add, divide or subtract result> 0 
Branch if FP add, divide or subtract result = 0 
Branch if FP add, div1de or subtract result <> 0 
Branch if float1ng SRC <= 0 
Branch if floating SRC < 0 
Branch 1f floating SRC >= 0 
Branch if floating SRC > 0 
Branch if float1ng SRC 0 
Branch if floating SRC <> 0 
Absolute value of float1ng point 
Sine 
Cosine 
Arc tangent 
Square Root 
Matrix multiplication 
Matr1x addition 
Matrix-times vector 
Gaussian Elimination 
Scalar times vector 
Add scalar to vector 
Cross product 
Add 2 vectors 
Generalized matrix multiplication general 
Generalized matrix addition 
Generalized matrix times vector 
Performs step 1 of the integration 
Performs step 2 of the integration 
Performs step 3 of the integration 
Performs step 4 of the integration 
Performs step 5 of the integration 
Computes the local truncation 
-4')-
TABLE 1 
Addressing Mode Executlon Tunes 
Integer F.P. Integer F.P. 
Modes Source Source Destlnation Destination ADR 
0 Register 1 1 1 1 
1 4.5 6 6.5 8 1 
2 Immediate 1 1 
3 Memory 4.5 6 6.5 8 1 
4 4.5 6 7.5 9 2 
5 Indirect 8 9.5 11 12.5 4.5 
6 5.5 7 7.5 9 2 
7 4.5 6 6.5 8 2 
8 Indirec t 8 9.5 12 13.5 5.5 
9 Indirect 8 9.5 11 12.5 4.5 
10 Indlrec t 8 9.5 11 12.5 4.5 
11 5.5 7 7.5 9 2 
12 4.5 6.0 6.S 8 2 
The numbers are wcyc1es. If the wcyc1e lnvo1ves a memory access, then lt 
is 1.S wcyc1es. Integer memory reference modes use 1 memory access, floatlng 
pOlnt involves 2. Indirect integer modes have 2 memory cycles, floating point 
lndirect modes involve 3 memory accesses. 
-46-
Mode 
0 
1. 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
TABLE 2 
Addressing Mode Time Estimates in ~seconds 
Uses Table 1 ~cyc1es * 150 nsec. per cycle 
Integer F.P. Integer 
Source Source Dest. 
ISRC FSRC IDST 
RegIster .15 .15 .15 
.675 .9 .975 
Immediate .15 .15 
Memory .675 .9 .975 
.675 .9 1.125 
Indirect 1.2 1.425 1.65 
.825 1.05 1.125 
.675 .9 .975 
IndIrect 1.2 1.425 1.8 
Indirect 1.2 1.425 1.65 
Indirect 1.2 1.425 1.65 
.825 1.05 1.125 
.675 .9 .975 
F.P. Jump 
Dest. Address 
FDST JADR 
.15 
1.2 .15 
1.2 .15 
1. 35 .3 
1. 875 .675 
1.35 .30 
1.2 .30 
2.025 .825 
1. 875 .675 
1.875 .675 
1.35 .30 
1.2 .30 
Used 150 nsec for read cycle time; 225 nsec for memory read/write time. 
(Usually there are 2 cycles involved in memory access; 1st does read, 2nd 
does check for 1 or 2 bit error. Would save overall time if 2nd ~cyc1e 
was short.) 
-47-
MOVE 
ADD 
JMP 
FADD 
FMUL 
FMOV 
SINE (typical) 
ATAN (typical) 
TABLE 3 
Example Executlon Times for Instruction 
(NOTE: Each Instruction is Without Fetch Cycle 
Which Adds in 4 ).1cyc1es (.6 jJsec.)) 
Times in ).1cycles 
2 + ISRC + IDST 
2 + ISRC + IDST 
1 +JADR 
5 +JADR 
R/R 6 + Add time 
R/M 10 + Add time + FDST 
M/R 6 + Add time + FSRC 
M/M 6 + Add time + FSRC + F])ST 
R/R 6 + Mult. time 
R/M 10 + Mult. time + FDST 
M/R 6 + Mult. tlme + FSRC 
M/M 6 + Mult. time + FSRC + FDST 
R/R 6 
R/M 10 + FDST 
M/R 6 + FSRC 
M/M 6 + FDST + FSRC 
73 + 14 adds + 11 multlplies 
76 + 10 adds + 6 multiplies + 2 divides 
MTMUL 36 + R*(3 + C*(2 + add tlme}) 
\ . (Matrix multiply) V 
SOLVE 
(Gaussian Ellmlnation) 
overlaps multLply 
2 N3 
on order of 2N*DT + N *MT + --2 
where DT dlvlde time 
ST subtract time 
MT = multiply time 
N = II of rows, II of columns 
RKMI 9 + N*(4 + MT) + FSRC + FDST 
-48-
in matrix 
TABLE 4 
The following are the expected execution times for various Node Processor 
instructions. To simplify the tables and give a best guess sampling of 
instruction times, the following assumptions have been made. 
a. For operands labled REGISTER, the operand may come from a 
register or be immediate (i.e., part of instruction). 
b. Operands labeled MEMORY are either mode #J or #3. (Probably 
the most typical.) 
c. Average microcycle time for bit slice is 150 nsec., 1nstruction 
with memory access is 1.5 microcycles. See Table 2 for 
explanation. 
d. Floating point operation times used: 
1.0 Multiply 
1.5 Add or Subtract 
15.4 Divide 
These numbers used would probably be smaller with sparsely 
filled matrices since operations with one zero operand are 
si~nificantly faster. 
e. Times do not include the instruction fetch times, which will 
be approximately .6 microseconds per instruction. 
Instructions with comparable execution times 
MOVE, ADD, SUB, AND, lOR, XOR, COM, NEG, INC, DEC, ADC, ABS, PUSH, POP 
JMP, JMPLE, JMLT, JMPGE, JMPGT, JMPEQ, JMPNE 
SKPLE, SKPLT, SKPGE, SKPGT, SKPNE, SKPGE, BITNE, BITEQ 
TSTLE, TSTLT, TSTGE, TSTGT, TSTEQ, TSTNE 
ASHL, ASMR, LSHL, LSHR 
-49-
FADD, FSUB 
FMOV, FPUSH, FPOP 
FSKPLE, LT, GT, GE, NE, EQ 
FTSTLE, LT, GT, GE, NE, EQ 
FJMLE, LT, GE, GT, NE, EQ, FJLE, LT, GE, GT, NE, EQ 
Reglster to Reglster to Memory to Memory to 
Functlon Register Memory Register Memory 
MOVE .6 1.425 1.125 1. 95 
ADD .6 1.425 1.125 1. 95 
JMp .3 
JSB .9 
FADD 2.4 4.2 3.3 4.5 
FMUL 1.9 3.7 2.8 4.0 
FMOV .9 2.7 1.8 3.0 
SINE 42.95 44.15 43.85 45.05 
AT AN 63.2 64.4 64.1 65.3 
MTMUL (10x10) 189.9 
(50x50) 4.53 ms 
SOLVE (lOx10) 1.3 ms 
(50x50) 101. 6 ms 
RKM1 (10) state variables 17.35 
(50) state variables 81.35 
* Time In pseconds except as noted 
-50-
SOFTWARE SPECIFICATION SUMMARY 
The software identified for the Digital System for Structural Dynamics 
Simulation includes the following segments. 
1. Offline Model Development Software. 
2. Realtime Execution Control Software. 
3. Node Processor Operational Programs. 
4. Node Processor Instruction Set. 
Offline Model Development Software 
The Offline Model Development Software includes the linker/assembler programs 
for generating Node Processor executable programs and a compiler to translate 
user specifications of simulation problems into a form that can be handled 
by the Node Processor operational programs. 
The assembler and linker programs are necessary to generate programs for the 
Node Processor. These programs are needed for any newly designed computer 
system. 
The Model Specification language compiler is designed to quicken the task 
of preparing for simulation runs. The language must first be deslgned before 
the compiler (translator) can be fully specified. Before the language is 
designed, a sample problem should be identified. The purpose of the sample 
problem is to provlde a focal point during the development of the language. 
It will provide useful guidelines for the language and thus reduce the 
probabllity of a costly over-generalized language. 
The translator should not be considered as a full blown compiler for languages 
such as Fortran, Basic and ADA. The language in all likelihood would be 
rigidly formatted and limited in scope so as to ease the tasks requlred by 
a compiler. The complIer would need to parse arithmetic expressions and 
generate the approprlate code for the Node Processor. Elaborate capabilities 
would prohab1y not he CObt effective. 
-51-
Realtime Execution Control Software 
The Realtime Execut10n Control Software includes a debugg1ng facil1ty and 
the Execution Control program for runn1ng simulat1ons. The debugg1ng 
facility is essential for developing the operational programs. The Node 
Processor has been designed to facilitate the opEration of a debugger. 
The instruction trace enable bit in the Processor Status Word and the hard-
ware memory address breakpoint comparator prov1de the necessary "hooks" 
for a debugger. 
The Execution Control Program is the software utJ1ity for controlling Slmu-
lation runs. It must be capable of handling the network 1n realt1me. A 
simpl1fied communicat1ons protocol and hardware des1gn are essent1al to 
keep the complexity of the program to a min1mum. Polled operation was 
selected over 1nterrupt or asynchronous operation for the commun1cat1ons 
because of the need for a simple hardware and software design. 
Node Processor Operational Program 
The Node Processor Operational Programs are needed to execute the substructure 
simulations with1n each processor. Full software specifications for the 
operational programs are not possible until a sample problem is selected 
and the Model Specif1cat1on Language is defined. The need for the operational 
programs 1S clear. Different sets of programs are needed for the various 
models the Digital System 1S to handle. 
Node Processor Instruct10n Set 
The Node Processor Instruction Set was des1gned as a comprehensive instruction 
set to accommodate the requirements of the simulat10n problems. The 1dent1-
fied needs of the instruction set included: 
1. General purpose 1nstructions for f1exibi11ty 1n programming 
the Node Processor. 
2. Matrix and vector operations for solving the state equations. 
3. High speed float1ng p01nt operations on double precision 
operands. 
-52-
4. Comprehensive addressing modes for handling the data structures 
required by the simulation programs. 
5. Microcoded parallel operation during lengthy calculations 
and I/O with nearest neighbors. 
The instruct10n set designed has incorporated the needs identified above. 
It includes a full complement of instructions typically found on the latest 
generation of minicomputers. The many addressing modes and opcodes speci-
fied provide over 10,000 distinct operations. The vast capability did not 
slow the expected execution speeds of instructions since the hardware 
incorporates the latest tech1iques in pipelined architecture. In addition, 
the proper selection of instruct10n decoding and execution techniques 
enabled the extens1ve use of microcode subroutining, thereby keeping the 
size of the operational m1crocode within reason. 
The higher level 1nstructions specified for the Node Processor all take 
advantage of parallel operations where possible. Software pipelining [2] 
techniques were used in 1nstructions which do matrix multiplication, 
gaussian elimination, sine/cosine calculation and Runge-Kutta-Merson inte-
gration. The use of software pipelining was possible because of the autono-
mous floating point multiplJer and adder units. 
In addition to uS1ng software pipelining wherever applicable, the floating 
point units were optimized for the operations expected during simulations. 
Floating point calculations are done on mantissas with 7 extra bits of 
precision so as to reduce round off error during intermediate calculations. 
The exponent range is the same in the floating point units as main memory. 
To speed up calculations on large matrices which contain many zero valued 
elements the floating point units detect zero operands on input and return 
their result immed1ately. When either operand is zero, the float1ng point 
-53-
result 1S returned up to 70% faster. 
SOFTWARE ASSESSMENT 
A very powerful custom-made processor instruct10n set was des1gned and 
flowcharted. These 1nstructions range from basic moves and integer 
operations to complex floatlng pOlnt operat1ons wlth matrlces. Slnce the 
node processor 1S microprogrammable, the suggested 1nstruction set 1S not 
cast in concrete. The 1nstructions were designed with simulat10n in mind 
and will make the solution of simulat10n models as efficient as poss1ble. 
Software routines for all levels of the system, however, are presently 
only in the conceptual stage. Offline model development software, real-
t1me execution control software, and node processor operation programs 
remain to be designed. Each section of the software lS in itself a sub-
stantlal task. 
The offline model development software will be made up of a compiler for 
the model specification language, an assembler and a 11nker for the node 
processor lnstruct10n set. It is not likely that an existing language 
wll1 satisfy the need for the model specification language, so lt must also 
be designed. The complIer for this language will translate the user model 
into node processor assembly language whlle segmenting the problem equally 
among the node processors. The assembler and linker would be fairly standard 
and slmilar to many available for m1nlcomputers. 
The realtime model execut10n software will consist of an execut10n control 
program Wh1Ch controls loading of the node processors and the synchronization 
of the processor array durlng run t1me. It also \nll collec t data from the 
processors when appropriate. A node processor program debugger will be a 
part of th1S software to ald in the development of operational programs. 
The node processor operational programs will be the true simulatlon programs. 
These programs Wl] I be dependent on the slmu] ation model. 
The node processor microcode, while flowcharted, remains to be written. 
These routines, sometimes called firmware, will implement the node processor 
instruction set. 
Above and beyond all of the simulation programs, it will be necessary to 
have diagnostic software to insure the proper operation of the control 
computer with the node processor array. Lower level programs should be 
capable of identifying board level faults wlthin a node processor. 
The above software assemblage will provide a fun('tionally complete and 
convenient package for structural dynamics simulation. 
-55-
NODE PROCESSOR HARDWARE 
The Node Processor hardware cons1sts of seven major blocks of hardware: 
1. The m1croprogram controller has the function of decod1ng 
program 1nstruct1ons and controlling the proper operation 
of the remaining SlX sections of hardware. 
2. The reg1stered arithmetic logic un1ts (RALU's) perform the 
integer ar1thmetic and lOglC funct10ns of the node processor. 
The RALU's are the BIT-SLICES. 
3. The dynam1c memory is the ma1n storage area for local pro-
gram and data storage with1n the node processor. 
4. The node processor communications hardware provides the I/O 
ports necessary for the SlX nearest neighbor communicat1ons 
and communications with the central m1nicomputer. 
5. The float1ng point bus interface and scratch pad 1S used 
between the CPU of the node processor and the float1ng point 
un1ts to buffer, hold intermediate float1ng point values, 
and expand/truncate float1ng point values. 
6. The float1ng point multipl1er 1S used to perform all float1ng 
point multiplies with1n the node processor. 
7. The floating p01nt adder/subtractor/divider performs all 
floating point addit1on, subtraction, and d1vision w1th1n 
a node processor. 
The relat10nship of the above hardware is shown in F1gure 6. 
Each node processor has its CPU implemented with microprogrammed bit-slice 
hardware. Bit-sl1ce hardware is currently available in ECL or TTL tech-
nology. TTL technology was chosen because of difficulties in designlng 
with ECL. ECL consumes a great deal of power, the variety of circuits 
available is llmited, and second sourcing of parts 1S a problem. 
-56-
I 
Ul 
....., 
I 
I 
CPU DYNAMIC 
REGISTERED K ~ RAM MICROPROGRAM 
..... 
,/ MEMORY CONTROLLER ARITHMETIC 
LOGIC UNITS 44 
4.3,1 43.3 
MlCRtXXXE FLOATING 
FLOATtm POINT PROM POINT BUS 
A ~ MUlTIPLIER INTERFACE K r-V 432 4.3.5 ..... 
45 
SIX- WAY NEAREST NEIGfBOR 
C~AnQNS INTERFACE FLOATING POINT 
~ ADDER/SIBTIflACTOR 4.3.4A /' DIVIDER 
r=::J. 43 4.6 
I> I~ A /',. /~ /\. ~ 
GLOBAL 
VVV\)\)\i BUS 4.3.4B 
, ,,, I 
NEAREST NEIGHBORS 
REF. 3.3 
Figure 6 NASA/Lewis Simplified Node Processor Block Diagram 
In TTL technology, Advanced Micro Devices (AMD) is the leader in bit-slice 
hardware. Other manufacturers second source many of AMD's products. 
Standard TTL, lower power Schottky, Schottky and new advanced Schottky 
are all compatible with the bit slice hardware. The most flex1ble, low 
power power, high speed design 1S possible w1th TTL technology. 
-58-
MICROPROGRAM CONTROLLER 
The microprogram controller is the section of the node processor CPU which 
selects a coherent sequence of microinstructions used to execute the varl0US 
instructions required by the processor. Each elemental task performed by 
the processor is called a microinstruction. A single machine instruction 
will take one or more microinstructions to execute. These microinstructions 
are stored in a permanent memory called microcode PROM. A sequence of 
microinstructions 1nitiated by a machine instruction is called a m1cro-
routine. Because there is a great deal of functional overlap, many machlne 
instructions will execute microroutines that share portions of the microcode. 
The node processor microprogram controller consists of the following hard-
ware: the Instruction Register, the Instruction Mapping Prom, the Address 
Mode Mapping Prom, an Address MUX, the Microprogram Sequencer, the Condition 
Code Select MUX, the Microprogram Memory, and the Pipeline Register. Also 
shown on the block diagram of Figure 7 is the Clock Generator. 
Instruction Register 
The Instruction Register 1S a 64-bit edge triggered register which holds 
the next machine instruction to be executed. It takes two microcyles to 
fetch the instruction from the dynam1c memory and load it 1n the IR. Each 
microcycle may only fetch 32 bits. Slnce the instruction 1S so wide, it 
contains 7 different fields capable of lmplementing a very complete 
instruction set. 
-'-------1-- -------j--··--S OP MODE REGISTER MODE 
CODE SELECT #1 SELECT #1 SELECT #2 
8 BITS 4 BITS 4 BITS 4 BITS 
---- ... ----- --- ... --- -----
REGISTER 
SELECT 112 
4 BITS 
INSTRUCTION REGISTER 
INDEX 
111 
20 BITS 
The 8-bit op code allows for up to 256 d1fferent op codes. Two registers 
and their address1ng modes mdY be selected. Two separate index values may 
be speCified. Of course, not all instructions wlll use all f1elds and the 
format may vary slightly among instructions. 
-')9-
A 
J2 
"I 
MEMORY DATA J2 
.32 JJ: 
~ C.K 64 BIT INSTRUCT/ON REGISTER CI( I--
OP COCE MOCE I REG. MODE REG. INDEX I INDEX "'I ., "'2 "2 "'1 ""2 
~ ~ ~~ 
f!5 
~~ t~ ~87 47 
INSrRUCTION ADDRESS MOC£ 
.3.3 • .3 CLOD< 
MAPPNG MAPPNG PRaI mHz ~ GEl£RATai 
PROM CRYSTAL 
o. 
Rl ,~ {MDRI/-MORC ~& ~~ 
ADDRESS ADDRESS 
MUX SELECT OTHER 
PSW STATUS 
I 
t urr 
~ Am2910 
V1~ 
COt-OTION c.c. 
4 CC.£N MCFa'ROGRAM CODE 
v PI:. SEQU£NCER SELECT MUX 
~~ JL MICROPROGRAM MEMORY 
(Am27S185A) PIPELINE OUTPUT 
~ 
~ OE PIPELINE REGISTER 
BRANCH I A~Mlss ccfJfkx ASEL 185£L lEA 4111 MICRO ADDRESS SELECT LINES DA eM "'-,_4 CNSTAN7 
, '7 '4 7 ~ 4 '47 , 87-
Figure 7 NASA/Lewis SDS Node Processor CPU 
Microprogram Controller 
-60-
-
~ 
t:: 
REF. 4.3.1 
Instruction Mapping PROM 
The Instruction Mapping PROM is a code converter which converts the eight 
bit Op Code into a 12-bit microaddress. It is a 256 x 12 bit wide area 
of PROM. The 12-bit microaddress is typically the start of the micro-
routine for the given microinstruction. 
Address Mode Mapping PROM 
The Address Mode Mapping PROM is a code converter which takes 4 bits from 
the Branch Address field of the Pipeline Register and 4 bits from one of 
the two address mode select fields of the instruction register and converts 
it lnto a 12 bit microaddress. Only several of the possible 256 different 
addressing modes are actually implemented. 
Address MUX 
The Address MUX selects one of four different branch addresses to the 
Am 2910 microprogram Controller. The four choices are the Instructlon 
Mapping PROM, the Branch Address field of the Pipeline Register, the Address 
Mode Mapplng PROM, or the least significant 12 bits of the Memory Data 
Reglster (MDR11 -MDRO)' 
Microprogram Controller 
The Am 2910 Microprogram Controller [3] is an address sequencer intended for 
controlling the sequence of execution of microinstructions stored ln 
microprogram memory. Besldes the capability of sequential access, lt 
provides conditional branching to any microinstruction within its 4096-
microword range. A last in, flrst out stack provides mlcrosubroutine 
return linkage and looping capability allowing five levels of nesting 
subroutines. 
Condition Code Se1ectMUX 
The Condition Code Select MUX selects the branch condition to the CC input 
of the Am 2910 Mlcroprogram Controller. The conditions are: 
-61-
1. F10atlng Point Multiplier 
a. ZERO 
b. CARRY 
c. NEGATIVE 
d. OVERFLOW 
2. Floating Point Adder 
a. ZERO 
b. CARRY 
c. NEGATIVE 
d. OVERFLOW 
3. Integer CPU 
a. ZERO 
b. CARRY 
c. NEGATIVE 
d. OVERFLOW 
4. (Floating Point Trap Enable) AND (Floating Point Mu1tipller 
Overflow) 
5. (Floating Point Trap Enable) AND (Floating Point Adder Overflow 
OR Dlvide by Zero) 
6. Trace blt set In Processor Status Word 
7. Memory Error 
a. One Bit Error 
b. Two Bit Error 
8. Floating Point Multiplier DONE 
9. Floating POlnt Adder/Subtractor/Divider DONE 
10. ALL READY on Global Bus 
11. Data Recelved on input latch 
12. Memory Wrlte Fault (Memory Address Register ~ Address Fence) 
on wrlte operatlon 
13. MEMORY TRAP: (One Bit Error) OR (Two Blt Error) OR (Write AND 
Memory Address Register ~ Address Fence) 
-62-
Microprogram Memory 
The Microprogram Memory contains all of the microroutines which form the 
instruction set of the node processor. Since the microcode word for the 
node processor has not been completely defined, nor has the microcode been 
written, the width and depth of the PROM area is not specified. Due to 
hardware development constraints the microcode depth is limited to 4K words 
and It is preferable to keep the width at 64 bits or fewer. 
Two possible choices for the PROM chips are the Am 275185A (2K x 8) or 
the Intel 3632 (4K x 8). 
Plpellne Re~ister 
The Pipeline Register allows an overlap of the fetch of the next micro-
lnstruction while the current microinstruction is being executed. The 
next instruction is being decoded while the microinstruction latched in 
the pipeline register is being executed. The position of the pipeline 
register immediately after the microprogram PROM causes this arrangement 
to be called the instruction - data based architecture. 
Clock Generator 
The Clock Generator is the Am 2925 System Clock Generator and Driver. This 
lntegrated clrcuit is programmed by the microcode in order to vary the 
microcycle length. It also contains the HALT/RUN and SINGLE STEP logic 
used for system debugging. Up to 8 different cycle lengths of the four 
phase output may be generated. 
-63-
REGISTERED ARITHMETIC AND LOGIC UNITS (RALU's) 
The RALU's are the hardware where the integer arithmetic and the lOglC 
operations of the node processor are performed. The RALU's have assoclated 
w1th them Register Select Input Multiplexers, Direct Input Select Mult1-
plexers, Shift Control Lo~ic, Memory Address Reg1ster, Memory Data Reg1ster, 
Memory Data B1directional Buffer, Processor Status Word Multiplexer, and 
Processor Status Word Register. See Figure 8. 
The above hardware 1S responsible for the generat10n of the next macro-
address and together with the M1croprogram Controller, forms the node 
processor CPU. 
RALU's 
The RALU's are the Am 2903 four b1t bipolar microprocessor sllces. There 
are e1ght of these bit slices for a 32-bit CPU. The RALU's perform all 
of the 1nteger ar1thmet1c such as memory address and all of the logical 
functions needed by the CPU. The RALU's contain 16 internal dual port 
reg1sters, varlOUS internal latches, and sh1fters. 
REGISTER SELECT INPUT MULTIPLEXERS 
The Reg1ster A Select Input Multlplexer selects one of flve dlfferent groups 
of 4 b1tS to bl~ the A Reg1ster address. The five bit f1elds are the REG #1 
instructlon field, the least slgnlficant 4 bits of INDEX #1 or INDEX #2 
1nstruction field, or the A SEL field of the Pipeline Register, or the 
MODE 2 field. 
The Register B Select Input Multiplexer selects one of four dlfferent groups 
of 4 b1ts to be the B Register address. The four blt flelds are the REG #1 
or REG #2 instruct10ns, the four least significant blts of the Memory Data 
Register, or the B SEL P1peline Register fleld. 
DIRECT INPUT SELECT MULTIPLEXERS 
The Direct A Input Select Multiplexer selects one of four different sets of 
-64-
I 
0\ 
I.J1 
I 
S, 
~21N0f:X N:lEX 
~v.y'y~ 
J1IIllJJl 
DIRECT A N'UT 
v 
SAU.TREX8? &. LOGIC 
MJX 5~_CTRLI 
~ L'::}1S~ ~ w •• 
CO< 
I-.-~ LATCH -----" 
LATCH v 5 
.-~ f-;o-r--S CK 
'---- MUX 
'---
/. .. 
I 
REG INDEX INOEX IttJOE A REG RE" LATCH 8 
., ~ 5EL 
471 "7 ..: "7": 47 ~ilillLlL 
Yy'~y'.y, VV.V9. 
~ REGISTER A SELECT ~ 5 tPUT I..Ii.A-TlPL£X£R ---v 
~47 
>2 "" A 
2903 29Ql 
alO AIR4Y SHIFT I Ca'lTFVL ~o RALU'S 
I 
... 1 ...... "-
E .. or. .. I •• .... 
FfPELIE itJ 
~207 
-+-MEMCRf AW£SS 
_S I/£GlSTER 
(AlAR) 
-
A Jot 
" 20 AEMCRY AlX:R£SS ZO 
" 
V 
-
-+< 
I£U)RY DATA BJS 
I I 
TO FLOATING 
POINT 8JS 
liTFREAC£ 
F£GlSTER 8 SELECT 
S tPUT AU.TPLEXER =:J 
":47 
Boa 3Z 
"-
QIO J SHIFT 
SID Ca'lTFVL 
y 
fl'NTERNAL DATA BUS 
~7 32 
'" 
IEMCRf /)ITA 
-
Of. 
'tGISTER "'DR) 
~ II 
Dill ~D4~ 
'"' 8lJIF£CTICWAL BUFFER -
MDII 
':v-' 
1J 
""""" '2 
':V:" 
a£CT 8 NVT 
SAU.TPLEX£R & LOGIC 
Toe., 
... a, 
"'" 
FP H03 STU' 
VY.V~Y.> I~U'U4U4U;1 
-!J D 
::= S PROCESSlJR STAruS 
IIOI?D AU.TIPL£X£R ::= 
-
, 
PROC£SSCR STATUS IIOI?D 
'" REGISTER 
(PSW) 
;) 
Figure 8 NASA/Lewis SDS Node Processor CPU Registered ALU's Block Diagram 
"'-
32 
i/ 
1..3.3 
inputs to the DA 2903 inputs. These fields are the 8 b1t f1eld formed by 
the Index #1 or Index #2 Instruct10n Register f1elds, an 8 bit latched 
constant, or the Processor Status Word. 
Shlft Control LOglc 
The Sh1ft Control Logic determines the type of Shlft done by the Am 2903. 
A one or a zero may be selected as the shift lnput, or a rotate may be 
selected. 
Memory Address Register (MAR) 
The MAR lS a 20 bit wlde positlve edge triggered register used to hold 
the current dynamic memory address. ThlS reg1ster also buffers the 
address onto the Address lines. 
MEMORY DATA REGISTER (MDR) 
The MDR 1S a 32 bit wide pos1tive edge triggered reg1ster used to hold 
values to be placed on the MEMORY DATA BUS. 
MEMORY DATA BIDIRECTIONAL BUFFER 
This buffer buffers the 32 blt w1de output of the MDR to the MEMORY DATA 
BUS and it buffers the data from the MEMORY DATA BUS to the INTERNAL DATA 
BUS. 
PROCESSOR STATUS WORD MULTIPLEXER 
The PSW multiplexer selects between the Internal Data Bus and the various 
Processor Status bits as inputs to the PSW reg1ster. 
PROCESSOR STATUS WORD REGISTER (PSW) 
The PSW holds informat1on concernlng the current status of the node processor. 
The PSW bits are defined as follows: 
-66-
Bit 
o 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
Mnemonic 
CR 
OV 
NE 
ZE 
FUNI 
FOVI 
FNEI 
FNEI 
FUN 2 
FOV2 
FNE2 
FZE2 
FDZ 
FTR 
TRC 
PAR 
RER 
RDY 
BPT 
FEN 
FLT 
PROCESSOR STATUS WORD 
Contents 
Integer carry 
Integer overflow 
Integer result negative 
Integer result zero 
FP multiply underflow 
FP multiply overflow 
FP multiply negative 
FP multiply result zero 
FP add/subtract/divide underf10w 
FP add/subtract/divide overflow 
FP add/subtract/divide result negative 
FP add/subtract/divide result zero 
FP d1vide by zero 
Trap enable on FDZ, FOVl, or FOV2 
via location 10 
Trace enable. Trap after each 
instruction. 
One bit error from memory access 
Two bit error from memory access 
Write only. Signals to host this 
processor is done with current step. 
Enable trap on sett1ng of FEN 
Memory write address equals MACR 
Memory write address less than MACR 
-67-
DYNAMIC MEMORY 
Each dynamic memory board contains 256K x 32 bits of memory (TMS 4164), 
a dynamic memory controller, memory and refresh tim1ng, and a sect10n for 
error detection and correction as shown 1n Figure 9. 
A maximum of four (4) boards may be addressed by each node processor. A 
node with four memory boards would have 1 megaword (32-b1t words). Except 
for CPU registers and floating point reg1sters, all of the program and 
data values used by the node processor are stored in the dynam1c RAM. 
Each board conta1ns 156 64K dynamic RAMS. The memory array has 128 inte-
grated C1rcUlth while the error detection and correction section uses the 
remain1ng 28. 
The Am2964B DynamIc Memory Controller is used to prov1de all address handl1ng, 
as well as RAS and CAS decoding and control. The device has 18 input latches 
for captur1ng an l8-bit address for memory control. The two highest order 
addresses are used to select one of four 64K x 32 bit blocks of RAM. The 
Am2964B also conta1ns an 8-blt refresh counter used to provide the necessary 
256 line refresh mode. The CAS output is inh1bited during refresh. 
Normal operat1on of the Dynamic Memory Controller 1S to prov1de the address, 
close the address latches and start off a normal memory cycle. Th1S 1S 
accomplished by brlnging the RASI input LOW which WIll cause one of the 
RAS outputs to go low. After the required memory tIming, the MSEL input 
is used to switch the multiplexer to the CAS latch. Then the CASI Input 
will be dr1ven LOW and execute the CAS part of the memory cycle. 
The refresh cycle IS executed by driv1ng the RFSH signal low which causes 
all four RAS outputs to go low. Th1S will slmultaneously refresh all four 
banks of memory controlled by the Dynam1c Memory Controller. When e1ther 
the RFSH or RASI input is brought hIgh, the refresh counter is advanced 
so 1t wlll be ready for the next cycle. 
-68-
1 
0\ 
\.0 
1 
~ 
g 
§ 
~ 
I 1 1 -. - _. -- _. -- _. --- _ .... _-"' -.~ ~ "-
.... -A.. 16 ADDR ~-O7 8 4-Am2965 [3Z ~~ APDfl. ? V V r 1 8-I.IC1.I6665 8 - MCJ.I6665 8-MCM6665 8-MCI.I6665 7-MCM6665 Am2964B 1 ~ RA~D RASO 1 ~ CASP RASe. 
I2 
Rs.'£l. 
.... 1 Am2966 I L ~~ A" "SEL, ..... I "I APbA. • 8-MCM6665 8-I.IC1.I6665 8-MCAl6665 8-MC1.I6665 7-I.IC1.I6665 
- -' ~ % 1 I I---- RA5C. ~i-:~~ CAS CJI I-- CAS(. 1 1 
I I Sl: ~}~ 1 I ADOA~7 8-MCM6665 8-MCA/6665 8-I./CI./6665 8-MCI.I6665 7 -I.ICJ,I6665 '---
-
RASa 
1 
-
CASB 
c 
fl. 
1 1 
1 1 ~ APOR_ 1 1 8-MCM6665 8-UCAl6665 8-I.ICM6665 8-MCM6665 7-MCM6665 t1 
"""" I 1 C.ASA 
1 
w. 
-
.... 
- -
1 
t ~ /0':. /0':. /0':. " L ~ 1 ~1 ~~ ~t ~t t ~ 1 Am2966 I I" I;'; 1:: ~! • c .. ~,:t.. 1 - .. S~f- ~f c_-I~ J91 S & 1 r:=-= • -it- -J ,. -l-~ 1 1 I - - - - - -, • 7- - -,I ';. - - ~ ~/ - -. ~7 - I r - :,7" I I I I I ~ I £R.ROR c.aAA.£C.TION ~ I 1 I ........ TIPlE 8lJS, 2 Am2961 2 Am2961 2 Am2961 2 Am2961 Am2960 SUf'FER5 L - - - - - - - - _I I A .. I 1 ~ I 
.... """,1 8 
- -~; ;1- - -I; . A .. ~ RI£Ff'ESoH L- _____ : - - __ .J )-~ T .... _ 
!;GHI,. ... T"'~8-Z ~ ~ a .. ~ .. r - - ~ Am2960 r=\I" x Q Q I-- ------ ---
.. ~ IER_I>ETE~ SC.HEMATIC <48Z-D-loa-1 AND t.OIlRUT,OtII 
"1 
I UN'TS 
L.z\.. ERAO~ - - - --- -I-- DETEC.TION AND ! C.OR.REC TIOH C.ONTft.OL.... II . (;C, •• TAO<. a 
REF. I A ~"-
4.4 
= < 31 Io£MORf ll4TA BUS 3& 
I.I£I.IORY AI .. ~ FENCE COAIPARATOR K DHI9 ANO LATCH " 
Figure 9 NASA/Lewis Node Processor CPU Dynamic Memory 256K WD x 39 Bit 
Data 1nterface among the dynamic memories, the Am2960 EDC c1rcu1t, and the 
node processor data bus is accomplished by means of the Am296l bus buffers. 
Each 2961 contains two internal latches, a multiplexer, and a RAM driver 
output buffer. Each 2961 is 4 bits wide so 8 are used 1n this 32-bit system. 
The bus input latch of the 2961 is used for data storage during a memory 
WRITE. The bus output latch 1n the 2961 is used predominantly for storing 
the output data if the processor is in single step mode. In the single 
step mode it 1S necessary to hold the output data on the system data bus 
but the memory must be free to be refreshed. 
A pair of Am2960 Error Detection and Correction units (EDC's) contain all 
the necessary logic to generate check bits on a 32-bit data field according 
to a mod1fied Hamming code and to correct the data when the check bits are 
supplIed. Opelating on the data read from memory, the EDC's can correct 
,111 single bit errors and will detect all double and bome triple bit l'rrors. 
For 32-bit worGs, 7 check bits are used. 
Some additional c1rcu1try 1S requ1red to provide proper memory access 
sequenc1ng and tim1ng for memory refresh. Slnce the 16 bits of address 
must be multiplexed into the dynamic RAMs 8 address llnes, the memory 
timing is necessary. It 1S necessary to allow the memory to be refreshed, 
with an eight b1t address, when 1t 1S not being accessed. When the CPU 1S 
running, refresh is automatic and transparent w1thin the m1crocode sequences. 
When the CPU is halted, such as during slng1e step mode, a special refresh 
counter period1ca11y refreshes the RAM. 
Aside from the two refresh modes described abOVE, the memory normally has 
three operating modes. 
the data 1nput latch. 
In the write mode, a 32-bit value 1S loaded 1nto 
The 7 check bits are generated by the EDCs which 
correspond to the 32-b1t value. At the end of the write cycle, the data 
and the check bits are written into the proper RAM location. 
In the detect mode the EDCs examine the contents of the Data Input latch 
(from the RAM) against the Check B1t Input Latch, and will detect all 
-70-
single bit errors, all doublt-bit errors, and some triple bit errors. If 
one or more errors are detected, the ERR status line to the CPU is pulled 
low. If two or more errors are detected, MERR is pulled low. Both ERR and 
MERR are open collector signals that remain high if there are no errors. 
In the Detect mode, the contents of the Data Input latch are driven directly 
to the Data Output Latch without correction. 
In the Correct mode, the EDCs function the same as in the Detect mode ex-
cept that the correction network is allowed to correct (complement) any 
slngle bit error of the Data Input Latch before puttlng it into the inputs 
of the Data Output Latch. If mUltiple errors are detected, the output of 
the correction network is unspecified, and both the ERR and MERR llnes are 
pulled low. If the single-bit error is a check bit, there is no authomatic 
correction; if desired, this would be done by placing the EDCs in generate 
mode to produce the correct check bit sequence for the data in the Data 
Input Latch. 
An option on the memory board is the Memory Fence Comparator. These inte-
grated chips should only be lnstalled on one memory board per node. If 
present, a speclal instructl0n called the Wrlte Fence Instruction will load 
an immediate 20-blt value into the Fence register. Whenever memory is 
written, the address is compared to the 20-bit Fence value. If address 
equals fence the status line EQAD is brought hlgh. If address is less 
than fence the status line FENCE is brought high. The EQAD status line 
finds use as a means of generating a hardware breakpoint. The FENCE status 
llne flnds its use in detecting illegal memory writes. 
The Memory Fence feature is used to insure that an area of dynamic memory 
has not been overwritten by mistaken. For example, a function look up 
table may have been loaded into the lower memory area and the Memory Fence 
is set at the top of this table. If a programming error or a hardware 
error forced a wrlte to this area of memory, the FENCE llne would be brought 
high. ThlS FENCE signal would be detected and warn the system operator 
-71-
that the resulting computations in the node processor may have been corrupted. 
The EQAD llne 1S brought high when the Memory Fence value equals the memory 
address. Th1S feature is used as a hardware break p01nt during program 
debugg1ng. 
-72-
NODE PROCESSOR COMMUNICATIONS 
The Node Processor Communications hardware is the most difficult part of 
the system architecture to define. (Unless otherwise noted. Node Processor 
Communications refers to communications between a node and its six nearest 
neighbors.) There are three schemes for this communication: shared memory. 
FIFO Buffers and six way communications controller. Each is discussed in 
the following paragraphs. 
Shared Memory 
A system with shared memory between nearest ne~ghbors would enable a node 
to directly access a section of its neighbor's memory. This is a fast. 
perhaps the fastest possible. method of data transfer between processors. 
There are several important disadvantages. The memory for such a system 
is more complex; it would require dual port memorles that are expensive. 
The memory width is 32 bits plus address and control lines. In dealing 
with six nearest ne~ghbors, more than 200 connections would be needed on 
each processor to implement shared memory. A large number of interconnects 
are not desired because of low reliability. 
FIFO Buff ers 
A second method uses narrower data paths (16 bits per transfer) wit}, a FIFO 
buffered input and output. Data are transferred between boards in two 
separate events or passes. Pass one consists of loading three output FIFOs 
and unloading three input FIFOs. Pass two changes the direction of the 
da ta transfer. Each pass sends data to three nearest neighbors and receives 
data from the remaining three neighbors. 
For example, during pass one, data are passed from processor N to n('igh-
bors 1, 2. and 3. Data are received from neighbors 4, 5 and 6. During 
pass two, data are received from neighbors 1, 2, and 3. Data are slnt to 
neighbors 4, 5, and 6. See ~igures (Pass 1 and Pass 2) on page 74. 
-73-
The maximum number of variables sent along the path from one processor to 
its neighbor is 100. Each variable is a 64 bit number. The path width 
chosen for data transfers is 16 bits. A FIFO depth of 400 is needed to 
hold the variables during any slngle path. The block diagram for the FIFO 
scheme is shown in Figure 10. 
The above two methods of processor communicatlons are quite costly in terms 
of the amount of hardware involved. Upon reference to Sample Problem 2 
(Determlnation of Processor Computatlonal Capabillties) implementatlon of 
6 
either scheme seems unjustifled. If 8 x 10 floating point computations 
were performed for each time step at a cost of 2 psec per operation (being 
very optimistic about the speed) the compute time would be 16 sec. The 
time it would take a node to output 100 variables, 16 bits at a time, 500 
nsec. per transfer, (being conservatively slow), then input 100 variables 
in the same manner would be 400 psec Even if data transfer occurred that 
slowly, nearest processor communications would only represent .0025% of a 
single time step. This leads to the conclusion that communications may be 
performed adequately wlth less hardware by a simpler set of six input/output 
ports. 
3 3 
4 O----:~--i )------':~ 1 4 
5 
6 6 
Pass 1 Pass 2 
-74-
I 
..... 
U1 
I 
~ ~ 
, ~M~~MO~R~Y __ B~U~S~ __________ -r ____ -. 
~ v 
~ ~ 
" STATUS BUS r=~rT~==~------ro---r----~ ~ t J t tt tt ~ 
CONTROL BUS 
, ) , v 
A 1 I ~ 1 
!u K ROW LOOP 1 ,I!:' COWIAN LOOP I ,'" RANK LOOP 
-!::: " ~ ~ :t::" c ~~ <\II~ ).. -"'-" -I~ -<\I-~ -~ -<\I Ci~ -.Jt!:!~ ~~!:!Vl~~ ~t!:!." t~!:!~~~ _it !:II§; ~- !H! ~ ~ "\.)' ~ ~ if ~ '" ... Ii ~ Cl e "\.)' ... ~ a ~ ~ ~ Cl Cl Ii ~ "\.)' trl ~ it ~ ~ ~ It if 
;! "l h! '" '" '" :::JI@ '" '" '" '" '" '" l.:i .,.. OUT - ~ ~ IN - ~ $I $ ~ OUT - ~ 4- 1\ - ill if! ~ ~ OUT r- ill '-+ IN I-- ill I $ 
FIFO ~ FIFO is ~ fi: FIFO C;j FIFO ~ ~ ~ FIFO ~ FIFO ~ I ~ 
I~ - ~I - 1 ~r - r - I I~ f-- r r---
ru,w; I I ~ I ~l .: ~ I 1 ~ I ~ 4~ 1 I ro I >0 L ~ 
CIiUITRYI I I I IL--_L-I_---, 
0L~ 0L~ L' 0L~~ 
~ - ~ 1* 1* 
m TRJ SELECT TRJ rRI SELECT TRI TRI SELECT 
DRIVER OOVER _ awER CRIIt£R f+- DRIVER CRN£R f+-
.{ ~ L ~ 1 /~ L ''\ ~ ,A 
i I ~ 
A "'/ A V i ,'.I1L_"'--""-7 __ ------'----'--_---' <.; ROW PmT A <.; CQ/.MII PmT A RAf'.I< PORT A o '< U 
I IlJW PC1'IT 8 ~ I CQ/.MII PORr 8 => I RAN< PORr 8 => 
-./1 v v 
ROW caUMN RANK 
REF. 4.J.4A 
Figure 10 NASA/Lewis SDS Node Processor CPU Block Diagram Six-Way Communications Interface 
s~x Way Communications Controller 
The six way communicat~ons controller is a set of six I/O ports under control 
of the node processor CPU. Except for the address, all six ports are identi-
cal. Each port ~s hardwired to the approprlate port on the six nearest 
neighbors of the node. See Figure 11. 
Each port conslsts of two 16-bit output latches wlth a common clock and 
separate output enables. There are two 16-bit lnput registers with separate 
clocks and a common output enable. 
Output of data to the nearest neighbor lS accompl~shed by fetchlng a 32-blt 
value and 1atchlng it in the output latch. The high order half word (16 bits) 
is sent when the CPU receives the EMPTY status slgnal from ltS nearest nelgh-
bor. The low order half word is sent in the next mlcrocyc1e. To send a 64-
bit value thlS procedure must be done twice. Figure 12 lS a flow chart of 
the output loop of the six way communications controller. Flgure 13 depicts 
the tYPlcal output operation of the controller. 
Input of data from the nearest neighbor can occur when the EMPTY status 
signal lS sent to the nearest nelghbor. First the high order half word is 
clocked lnto the high order lnput register. In the next mlcrocycle, the 
low order half word is latched into the low order input register and the 
EMPTY slgnal is set (=1). Once EMPTY = 1 the CPU may take the data and put 
it in ltS intended destination. Flgure 14 lS a flow chart of the input loop 
of the six way communications controller. Figure 15 deplcts the tYPlcal 
input operatl0n of the controller. 
There are three signals used for handshaking between nodes. When EMPTY is 
active low, the lnput port is ready to be loaded from the nearest nelghbor. 
The other two handshake lines clock the data into the lnput register of the 
nearest neighbor. The two clocks CKH and CKL clock the upper and lower 
halves of a typical transfer into the input register of the nearest neighbor. 
-76-
I 
'-l 
'-l 
I 
'NP(Jr 
REGISreR 
TcffROJI I/CIGHBOR ~ TO/FROItI He,G/_EOR , 
Figure 11 NASA/Lewis Six-Way Communication Interface Block Diagram 
C=?utput Instruction 
CPU Accesses~~oryl 
or Register 
1 
r---32-Bit Data Latched 
~proper Output Register 
J~ 
/ Check <: EMPTY of 
eighbor = 1 
Output Enable Upper Hal-f-Of--o-u-tput~ 
Register, Clock Data Into Nearest 
Neighbor's Upper Half of Input Reg1ster 
With CKH 
'---,------------
Output Enable Lower Half of Output 
Register, Clock Data Into Nearest 
Ne1ghbor's Lower Half of Input Register 
with CKL 
------r------ -----
Figure 12 
Another 
Output 
"" Instructioty 
'" ' 
" ,to 
YES 
Six Way Communications Controller, Output Loop 
-78-
LOAD OUTPur 
REGISTER 
TEST STATUS (EMPTY) 
WHEN EMPTY == 0 SEND 
HIGH HALF WORD WITH 
CLOCK CKH 
SEND LOW HALF-WORD 
WITH CLOCK CKL 
Figure 13 Output Operation of the Six Way Communications Controller 
-79-
Figure 14 
Input Instruct10n ) 
~----~/ 
/.~TY =- 0 
of Input 
Port 
'" 
Output Enable Input Re~iS;~1 
of Proper Port,Put Data in 
Proper Destination 
Another 
Input 
Instruc tion YE 5 
NO 
( Continue) 
Six Way Communications Controller Flowchart, 
Input Loop 
-80-
32 
.-----'--, L 
ASSERT EMPTY = 0 WHEN 
READY FOR INPUT 
THE HIGH ORDER HALF WORD 
IS SENT AND LATCHED BY 
THE ADJACENT NODE 
THE LOW ORDER HALF WORD 
IS SENT AND LATCHED BY 
THE ADJACENT NODE. EMPTY = 1 
(Not empty) 
EMPTY = 1 IS DETECTED BY 
THE CPU AND THE DATA IS 
READ FROM THE INPUT LATCH 
EMPTY IS SET TO O. 
Figure 15 Input Operation of the Six Way Communications Controller 
-81-
Global Bus Communications Port 
Each node processor has a slngle 32-bit bidlrectional port known as the 
global bus port. Thls port IS used for communications between any node 
and the control computer. 
At the node processor, the port consists of a 32-blt input latch, a 32-bit 
output latch, and a status fllP flop. 
The type of handshake to be used has not been decided at thlS time. 
-82-
FLOATING POINT BUS INTERFACE (FPBI) AND SCRATCH PAD 
The FPBI and Scratch Pad is designed to buffer, format and store data 
between the node processor CPU and Floating Point units (i.e., Floating 
Point Multiplier and Floating Point Adder/Subtractor/Divider). 
Typical CPU floating point values are 64-bits wide and take up two 32-bit 
wide words of CPU memory. On the other hand, the Floating Point units both 
handle 72-bit wide floating point values. Hence the need for a CPU to 
Floating Point unit interface. The two Floating Point word formats are 
shown below: 
64 BIT CPU Floating Point Value 
I I 
63 62 55 S4 32 31 
''---y--J '--______ -y 
L 55 Bit Mantissd 
o 
J 
L 8 Bit Biased Exponent 
Sign Bit of Mantissa 
72 BIT Floating POl.nt Bus Value 
I I 
6L 0 
I _______ -------~) 
" --------y-
71 ~62 
62 Bit Mantissa 
Zl'ro Bit 
8 Bit Biased Exponent 
~------------------------ Sign Bit of Mantissa 
-83-
CPU to FPBI Transfers 
As shown above the FPBI must accept two 32-bit values from the CPU and 
expand this value to 72 bltS. The mapping of thJS expansion lS shown 
below: 
~CPuword~ CPU Word [ 
A ( \ r l( 4 
CPU BIT 63\ 62-55 I ------- \54-32 II 31-0 F.P. BUS BIT: 71 70-63 *62 61-39 38-7 I , I 
*F.P. Bus Bit 62=0 If blts 62-55 are all zero; else 62=1. 
**F.P. Bus Bit 6-0 are all set to "0". 
FPBI to CPU Transfers 
....., 
I::~~~---
Bit 62 and blt 6-0 of the FBPI word are truncated on a transfer from 
the FPBI TO THE CPU: 
F.P. BUS BIT: 71 70-63 61-39 
CPU BIT 63 62-55 54-32 
L ___ ---y-
CPU Word 
SCRATCH PAD Area 
I 38-7 32-0 
J\.~ 
CPU Word 
"0" 
The SCRATCH PAD area of the FBPI is a 256 x 72 blt wlde area of fast statlc 
RAM. The SCRATCH PAD serves several purposes: 1) It allows quick access of a 
commonly used operand (l.e., parallel access to all 72-bits versus two accesses 
to the slower CPU Dynamic RAM); 2) Greater precislon is malntalned In the 
72-bit intermedlate value; 3) In matrix operatlons, a whole row of a matrix 
may be stored for convenient access. 
-84-
-Operating Details of FPBI 
1.]rite to SCRATCH PAD from CPU 
Step 1: An 8-bit address is sent to the F.P. address transparent latch 
while the least significant 32-bits of the F.P. value is latched 
in an 1nput latch. 
Step 2: The higher order 32-bits of the F.P. value are sent along with 
the write command which causes the proper 72-bit value to be 
written in the scratch pad at the appropriate address. 
Read from SCRATCH PAD to CPU 
Step 1: An 8-bit address is sent to the F.P. address transparent latch 
and the upper 32-bits are read in on the data bus while the lower 
32-bits of the F.P. word are latched. 
Step 2: The lower 32-bits are read in from the latch to complete the 
transfer. 
Write from SCRATCH PAD to FLOATING POINT BUS 
An 8-bit addrec,c, is sent to the F.P. address tn.lnt>pnrent latch, the F.P. 
bus drivers are enabled, and a scratch pad Read is enabled. At the end 
of this step the appropriate register in a Floating P01nt Unit latches 
the F.P. value from the bus. 
Read from FLOATING POINT BUS to SCRATCH PAD 
An 8-bit address is sent to the F.P. address transparent latch, the F.P. 
bus receivers are enabled, a scratch pad write is enabled and the appropriate 
register in a Floating Point Unit is enabled on the bus. At the end of the 
cycle the result is latched lnto the scratch pad. F.P. status is also 
latched in the seven bit status latch. 
-85-
REGISTER to REGISTER transfer 
Step 1: An 8-bit (source) address is sent to the F.P. address transparent 
latch, a scratch pad Read is enabled and the output transfer latch 
stores the 72-b1t floating point value. 
Step 2: An 8-bit (destination) address is sent to the F.P. address trans-
parent latch, the output transparent latch is output enabled, 
the F.P. bus receivers are enabled and a scratch pad wr1te is 
enabled. 
FPBI Hardware Description 
The FPBI has f1ve main parts: The floating point address latch, CPU memory 
data bus buffers and latches, an 8-bit comparitor, 256 x 72-bit scratch pad 
RAM, floating p01nt bus buffers and latches. See F1gure 16. 
The f1oat1ng p01nt address latch holds the 8-b1t address from the CPU of 
the scratch pad RAM. This latch 1S transparent whlch means it may be opened 
during one cycle and stored on subsequent cycles. 
The CPU memory data bus buffers and latches are used to mu1t1p1ex and trun-
cate the F.P. data on a scratch pad read. Data 1S latched, buffered and 
expanded dur1ng a scratch pad write. Two cycles are required for a read or 
wr1te to the memory bus Slnce it is only 32-bits wide. 
An 8-bit compar1tor compares the exponent with zero and sets the zero detect 
bit on a write from the CPU to the scratch pad. 
The 256 x 72-b1t scratch pad is a fast read-write memory used to hold the 
expanded double word operands used in the floating point units. 
The floating point bus output latch and input buffer isolates the F.P. bus 
from the F.P. UNITS and the F.P. scratch pad. There 1S a 72-bit buffer from 
the F.P. BUS to the scratch pad. There 1S a 72-b1t transparent latch from 
-86-
I 
00 
..... 
I 
f.lEMORY D4TA 
DZ3-D3111 "ca· Nil _I 1 
nl I 
'3Z 
,< >< '-Z -I " B A 
BBiT 
COMPARfTOR LATCH & LOGIC 
74F251 
~1 =0 F rnl._1:'"7. 
t 
.. 
ID-4~- RM7 ____ If 
--
'Nfl SCRAT£H PA 
----1 dA- D. A._o D.'-7 RAM} A,.o D" .. ZERO 9.3L422 w~ SIGN d wa DETECT 
:: EXPONENT ~$ BIT 
LATCH 
704F37i 
FPII/- FP71 
72 BIT FLOATlNt; 
CONTROL STATUS 
I I 
FLOATING 
POINT STATUS 
,t.ND CONTROL 
ZD olD-
,-
-J £J.FFER 
704F~"'''' 
~l 17 .. l&-.. ... ~7 
Figure 16 NASA/Lewis SDS Node Processor CPU Floating Point Bus Interface Block Diagram 
3Z 
REF. 4 3, ~ 
the scratch pad to the F.P. bus. 
The floatlng point control lines are derived from CPU microcode blts. 
The floatlng point status lines are latched In a seven-bit latch. 
-88-
FLOATING POINT BUS CONTROL LINES 
WRFP 
RDFP 
ADD/MULT 
FPR2-FPR0 
WRFP RDFP 
H H 
L H 
L H 
L H 
L H 
H 
L H 
L H 
L H 
H L 
L H 
L H 
H L 
X X 
L L 
A low on this line is used to write data into a Floating Point 
Unit. 
A low on this line is used to read data from a Floating Point 
Unit. 
A high on this llne is used to access to the Floating Point 
Adder/Subtractor/Divider. A low on this llne is used to access 
to the Floating Point Multipller. 
These llnes determine the register of the Floating POlnt Unit 
accessed and in the case of the Adder/Subtractor/Divider, the 
function to be performed. 
ADD/MULT FPR2 FPRl FPR0 
X X X X NO OP 
H L L L Write X operand to Adder 
H L L H Write Y operand to Adder -
Functwn:Add 
H L H L Wrlte Y operand to Adder -
Function:Subtract 
H L H H Write Y operand to Adder -
Function:Float to Fix 
H H L L Write Y operand to Adder -
Function:Flx to Float 
H H L H Write Y operand to Adder -
Function:Check Status 
H H H L Write Y operand to Adder -
Function:Divlde 
H H H H NO OF 
H L L L Read Adder result and status 
L X L L Write X operand to Multiplier 
L X L H Write X operand to Multipher 
L X L L Read Multiplier result and 
status 
L X H X NO OP 
X X X X Illegal Condition 
-89-
FLOATING POINT BUS STATUS LINES 
OVF Low on this line indicates the result is beyond the range of numbers 
wh~ch can be represented. All b~ts of the result set to 1. 
UFL Low on th~s line indicates the result ~s smaller than the smallest 
number which can be represented. All blts except for the ZERO BIT 
(BIT 62) are set to O. 
NEG H~gh on this line ~ndicates the result of the operation was a neg-
at~ve number. It ~s the same as the negat~ve b~t of the mant~sSd. 
ZER Low on th~s line indicates the result of the operation was zero. 
All bits of the result are zero. 
DNM Low on this line ~ndicates the Mult~plier is done and ready for 
new input. 
DNA Low on this line ind~cates the Adder ~s done and ready for new 
~nput. 
D~Z Low on this line indicates a divide by zero error in the floating 
point Adder/Subtractor/Divider. 
-90-
FLOATING POINT MULTIPLIER (FPM) 
The Floating Point Multiplier is a very high speed microprogrammed logic 
board designed to exclusively perform all floating point multiplication 
within a node processor. 
The FPM is connected to the CPU of a node via the Floating Point Bus Inter-
face (FPBI). The interface buffers, formats and stores up to 256 72-bit 
floating point values for processing by the CPU or the Floating Point units. 
(See Floating Point Bus Interface.) 
The FPM has three registers which are accessed via the FPBI. They are the 
X operand input register, the Y operand input register, and the Result and 
Status output register. The floating point control bus signals required to 
access these registers are: 
RDFP WRFP ADD/MULT FPRI FPRO 
H H X X X No Op 
X X H X X No Op (FPA) 
H L L L L Load X operand of mu1 tiplier 
H L L L H Load Y operand of multiplier 
L H L L L Read Result and Status of 
Multiplier 
The FPM is loaded under control of the node processor CPU. The order of 
operand loading is important. The X-operand is ordinarily loaded first. 
Upon the loading of the Y operand, the multiplier begins execution. On a 
succeeding multiply, if the X operand does not change, only the new Y operand 
need be loaded. The multiplier will proceed using the old X operand and the 
new Y operand. The result returned is a 72-bit product with approprlate 
status bits. 
-91-
Assembly Language Instructions 
The floating point multiplier is under control of the node processor CPU. 
The instruction FMUL is the only ~nstruct~on wh~ch uses the FPM. 
Multiplication Algorithm 
The floating point numbers have the representation of 1 sign bit, an 8-bit 
exponent with a bias of 128 and a 63-bit mantissa for a total of 72 bits. 
This format provides a range of 10-37 to 10 38 with 19 digits of precision. 
In all cases the floating point inputs are normalized numbers. Also, if the 
exponent is zero (-12810) then the number is zero. ThlS eliminates gradual 
underflow or operation wlth vanishing numbers 
I I 
71 
--- -------
63-bit mantissa (bit 62 is also the 
zero bit) 
8-bit biased exponent 
--- mantissa sign bit 
The floating point multiplication is done ~n two relatively independent 
processes. One process determines the sign and exponent of the result, 
the other process determines the mantissa. The two processes lnteract when 
the final mantissa may need to be normalized, thereby changing the exponent 
of the result. The multiply algorithm is flowcharted ln Figure 17. 
The sign of the result is 1 (negatlve) if, and only if, the slgns of the 
inputs are not equal. The exponent is found by adding the input exponents 
-92-
START 
Multiply 
lD , 2D , 3D .4D 
E:=e:X-l28. 
e::=e:+e:Y 
C2:=C 
out 
NO 
Multiply 
lC,2C,3C,4C 
S :=2D4D 
N:=S+1D3D 
YES 
YES 
YES 
YES 
ZERO RESUL 
FZEl:"l 
FINl:"'l 
FUNl: .. l 
Figure 17 Floating Point Multiply Flowchart 
-93-
RESULT:-
All l's 
FOVl:=l 
FINl:-l 
Multiply 
lB,ZB,3B,4B 
S:=N+ZC4C 
N:=S+1C3C 
Multiply 
lA,ZA,3A,4A 
S:=N+ZB4B 
N:=S+1B3B 
S:=N+ZA4A 
N:=S+IA3A 
N:=N + 
Round Bi 
YES 
RETURN 
SHIFT 
RESULT 
E:=E-l 
Figure 17 Continued 
-94-
ZERO RESULT 
FUNl:=l 
FZEl:=l 
FINl:=l 
RETURN 
and subtracting the bias (12810), After the result has been normalized 
underflow will occur if the exponent is below the minimum allowed. Over-
flow will occur if the exponent is greater than the maximum. 
The calculation of the result mantissa is done by finding the sum of 16 
partial products, each 32 bits, to form a 128-bit result. The result is 
then normalized, rounded and truncated to 63 bits. 
The X and Y input mantissas are represented by four 16-bit fle1ds. The 
63-bit input mantissas are left justified within these 64 bits. 
X A B c D 
Y 1 2 3 4 
The 32-bit result of a 16 x 16 multiply is represented by the 2 16-bit fields 
multiplied. The 16 32-bit partial products are added to form a 128-bit 
result. The partial products must have their LSB aligned with the proper 
bit in the 128-bit result before adding. The alignment of the partial 
products may be visualized as: 
RESULT 
127 o 
-95-
The hardware does not maintain the full l28-bit result. The lower 64 bits 
are used to set guard and sticky bits are used when rounding is accompllshed. 
(See round and normalize hardware.) 
F . P. MULTIPLIER HARDWARE 
The FPM hardware consists of: a microprogram controller and PROMS, two 
operand lnput registers, an exponent ALU, four array multipliers, a mantissa 
ALU, a mantissa shifter, round and normalization logic and an output buffer. 
See Figure 18. 
Microprogram Controller and PROMs 
The FPM has its own microprogram contained in PROM. The microprogram 
counter is a binary up counter with preset and reset capabilities. The 
counter is reset to zero when the Y operand is loaded. The counter normally 
sequences through the microcode. Certain microinstructions allow conditional 
or unconditional presetting of the counter which causes branching to different 
sections of the microprogram. At the end of the multiply microroutine, 
the counter is disabled untll the new operand(s) is(are) loaded. 
Operand Input Registers 
The input reglsters are positive edge triggered registers which are addressed 
and loaded under the control of the CPU. There are 2 72-blt input registers. 
One holds the X operand; the other holds the Y operand. Also on the lnput 
of the FPM is some decoding logic which determines from the CPU FP control 
signals whether the X operand is to be loaded, the Y operand is to be loaded 
or the Result is to be read. 
Exponent ALU 
The exponent ALU consists of an elght bit ALU, an output register, an output 
buffer and two selectable constant inputs. The A input of the exponent ALU 
may be either the exponent of the X operand or the output of the latch or 
-96-
I 
\.0 
-....J 
I 
r 
Figure 18 
"8r D ". I 
aocx G[~ 
AI~~ 
t(NGJ).j ~rrn. 
~.~, .. ,. ~~ 
3S:~l 
A1D·s 
NASA/Lewis SDS Node Processor Floating Point Multiplier Block Diagram 
i 
I 
--~~~ .;~.:-:: 1] r- - I
'KVC' J,OCJ r 
· ....... c, «r..- : 
--------
,:~ '-P,",.r .. n 
the adder output. The B adder input may be the exponent of the Y operand, 
the constant -1, or the constant -128. 
Array Multipliers 
There are four 16 x 16 array multipllers which are used to generate four 
32-blt partial products simultaneously. Four multiplies are done in each 
multiplier to compute all 16 partial products. The blts generated by the 
multlpliers are latched in four separate 32-bit tristate registers. 
Mantlssa ALU 
The mantissa ALU lS a 68-blt wide ALU. The ALU output lS loaded lnto the 
parallel load lnput of a 68-bit shifter. The A input of the adder may be 
either the output of the 68-bit shlfter, or the output of the 68-bit shlfter 
right shlfted 16 blts. 
The B input of the ALU selects between two pairs of latched multiplier 
outputs. 
Mantlssa Shlfter 
The mantlssa shlfter latches the output of the 68-blt adder. The shifter 
lS used to left shift the result if normalization is necessary. 
Round and Normalize LOglC 
Slnce there are more bits computed than are retained for a final result, it 
is necessary to round the result. All results are rounded to the nearest 
expresslble value, with rounding to even If the value is exactly between 
the 2 posslble representatlons. In this scheme, a guard blt and a sticky 
bit are necessary. The result of a mantlssa multipllcation is 64 bits. 
The LSB of the intermediate result is deslgnated as the rounding blt. The 
guard bit is the blt immedlately to the right of the LSB which would normally 
be lost. In the event the lntermediate result must be left shifted to 
normalize, the guard bit lS shifted lnto the LSB of the result and becomes 
the rounding blt. 
-98-
The sticky bit and guard bit are used to correctly determine which direction 
to round when done. The sticky bit "remembers" l.f there were any bits set 
in the low order 63 bits of the 128-bit result. These low order bits are 
not carried through the computation. Instead, during each of the four 
multiply steps, the low order 16 bits are passed to the guard and sticky 
bit logic. This logic does the following: 
1. Logl.ca1 OR of guard bit with sticky bit 
2. Logical OR 15 least significant bits with sticky bit 
3. MSB of 16 bits becomes new guard bit 
In the round step a 1 is added to the mantissa if and only if 
1 = (Bit 63) (Bit 0) • (Guard + Sticky) + (Bit 63) • (Guard)(Sticky) 
-99-
FLOATING POINT ADDER/SUBTRACTOR/DIVIDER (FPA) 
The FloatIng Point Adder is a high speed mIcroprogrammed logic board 
designed to perform floatIng point addition, subtraction and divisIon. 
The FPA also will convert a floating point number to fixed pOInt, a fIxed 
point number to floating pOInt, and set the appropriate status for a single 
input. 
The FPA is connected to the CPU of a node via the Floating Point Bus Inter-
face (FPBI). The interface buffers format and store up to 256 72-bIt float-
Ing point values for processIng by the CPU or the Floating Point units (see 
Floating Point Bus Interface). 
The FPA has three registers WhICh are accessed VIa the FPBI. They are the 
X-operand input register, the Y-operand input/command register, and the 
result/status output regIster. Floating POInt control bus sIgnals to access 
these regIsters are: 
WRFP 
H 
L 
L 
L 
L 
L 
L 
L 
H 
RDFP 
H 
H 
H 
H 
H 
H 
H 
H 
L 
ADD/MULT 
X 
L 
H 
H 
H 
H 
H 
H 
H 
FPR2 
X 
X 
L 
L 
L 
L 
H 
H 
L 
FPRl 
X 
X 
L 
L 
H 
H 
L 
L 
L 
-100-
FRR~ 
X 
X 
L 
H 
L 
H 
L 
H 
L 
No Op 
No Op (FPM) 
Wrlte X-Operand to FPA 
Write Y-Operand to FPA 
Function: ADD 
Write Y-Operand to FPA 
Function: SUBTRACT 
WrIte Y-Operand to FPA 
Function: FLOAT to FIX 
WrIte Y-Operand to FPA 
FunctIon: FIX to FLOAT 
WrIte Y-Operand to FPA 
FunctIon: CHECK STATUS 
Read FPA Result and Status 
FP Adder/Subtractor and Divider Hardware 
The FPASD hardware consists of: a microprogram controller and proms, an 
exponent ALU, a mantissa ALU, a mantissa shifter, and output buffers and 
latches. See Figure 19. 
Microprogram Controller and Proms 
The FPASD has 1tS own microprogram contained in PROM. The microprogram 
counter is a binary up counter with preset and reset capabilities. The 
counter 1S preset to a special address in the lower address space of the 
microcode PROM upon the 10ad1ng of the Y-operand. The preset address is 
determined by the FP register control bits. The microprogram then executes 
the instructions for the proper algorithm. Certain microinstructions allow 
conditional or unconditional branching to different sections of the micro-
program. Otherwise, microprogram execution is sequential. 
At the end of a microprogram, the counter is disabled until the new operand(s) 
is (are) loaded. 
Operand Input Registers 
The input registers are positive edge triggered registers which are addressed 
and loaded under control of the cpu. There are two 72-bit input registers. 
One contains the X-operand while the other contains the Y-operand. Also on 
the input of the FPASD is some decoding logic which determines from the CPU 
FP control signals whether the X-operand is to be loaded, the Y-operand is 
to be loaded, or the Result is to be read. 
Exponent ALU 
The exponent ALU consists of an 8-bit ALU, four registers, and a 8-bit counter. 
The f1rst register is on the output of the ALU. The output of the first 
register feeds the 8-bit counter, a second register which has its output 
dr1ving the A input of the ALU, and a third reg1ster used to hold the exponent 
result and drive the FP bus. The fourth register is on the outputs of the 
8-b1t counter and drives the B inputs of the ALU. 
-101-
I 
!-' 
o 
N 
I 
nOATU; ~, CCWTfO.. ~ STATUS BUS 
raTION 
ODE lIUX 
"". 
U>aR 
""Rn"" 
tnTRFlCW 
n"-n~ 
FtQ41l'1G fa<lT ~TA BUS 
- -, 
_I 
- -I 
"\/ I 
n ,I 
:1 
r64-&T AW-: I ~, --'"y"-----, 
I II 
I 
11 
I 
1 I 
itf!iM.T ~ 
OUW1l"'~ 
ITRfST.A7F - -~9 
LArCH~ 
~ ~ - -~-~----------------
I~, ,:i rl---------------. 
I,.. If.! 
Figure 19 NASA/Lewis SDS Node Processor Block Diagram F1oat1ng Point Adder/Subtractor/Divider 
.~r dt:. 
In other words, input A to the ALU may come from either the X input register 
or register number 2. The B input of the ALU may come from either the Y input 
register or register number 4. 
Mantissa ALU 
The mant~ssa ALU consists of two 64-bit latches, a 64-bit ALU, a 68-b~t shifter, 
a 64-bit transparent latch, a 64-bit zero detect circuit, and a 72-bit shifter. 
Either of the two 64-bit intermediate mantissa registers may be loaded from 
the X operand input register, the Y operand input register, the transparent 
latch, the output of the A register itself, or the 72-bit shifter. 
The ALU A input may come from either the X or Y operand input registers, 
the transparent latch, or the A intermediate mantissa register. The ALU B 
input may only come from the B intermediate mantissa register. 
The input of the 68-bit shifter may only come from the 64-bit ALU. The input 
of the 64-bit latch may only come from the 68-b~t shifter. 
The 72-bit shifter, the zero detect circuit, and the 64-bit output buffer 
all have the same inputs as the ALU A input. 
Assembly Language Instructions 
FLOATING POINT ADD 
FLOATING POINT SUBTRACT 
FLOATING POINT DIVIDE 
FIX TO FLOAT 
FLOAT TO FIX 
STATUS CHECK 
-103-
Float1ng Point Add1tion Algorithm 
Floating point addition is performed in the Floating Point Adder/Subtractor/ 
Divider according to a two's complement addition algorithm described below. 
X and Yare the floating point input operands and Z is the sum. The flow-
chart for th1S algorithm is shown in Figure 20. 
1. Compare X-operand and Y-exponent. If Y-exponent 1S greater 
than X-exponent swap X and Y so that the larger value is 1n 
X. If the exponents are equal, leave X and Y alone. 
2. Subtract the exponents so that D = X-exponent - Y-exponent. 
If D 1S greater than or equal to 63 the answer is X and the 
procedure may be stopped, otherwise continue. 
3. Convert the slgned magnitude mantissa to two's complement 
form. 
4. Shift Y-mantissa to the r1ght D times. 
5. Perform a two's complement addit10n of the two mantissas. 
Shift the result one place to the right and increment the 
exponent 1f a carry is generated. 
6. Convert the two's complement result to slgn magnitude form. 
7. Normal1ze the result by shifting the mant1ssa left until a 
1 appears 1n the most signif1cant bit. Decrement the 
exponent for each shift. 
8. Round and latch result. 
F1oat1ng Point Subtraction Algorithm 
F1oat1ng p01nt subtract10n differs only slightly from floating point 
addition. If Z = X-Y, change the sign of Y and proceed w1th steps 1 through 
8 of the floating point add1t1on algorithm. The flowchart for the a1gor1thm 
1S shown 1n Figure 21. 
-104-
E+e:A-e:B 
C+COUT 
T 
B+MY 
S+B 
CNTR+E 
SHIFT S 
RIGHT CNTR 
TIMES 
B+S 
E+EA 
S+A+B 
H+E 
CNTR+O 
YES 
YES 
E+EA 
S+A 
E+H+l 
SHIFT S 
RIGHT 1 
S+B plus 1 
A+S 
S+B plus I 
B+S 
E+e:B-e:A 
S+A 
CNTR+e: 
SHIFT S 
RIGHT CNTR 
A+S 
E~·e:B 
Figure 20 Floating Point Addition Flowchart 
-105-
E+EB 
S+B 
YES 
S+B plus 1 
S+O 
y~ E+O 
SHIFT S LEFT 
UNTIL S(MSB-1)=1 
COUNT IN CNTR 
E+H-CNTR 
C+C 
out 
FIN2+1 
FNE2+SZ 
RETURN 
Figure 20 Continued 
-106-
FZE2+1 
E+A111's 
S+A11 l's 
FOV2+1 
START 
SY+Sy 
Figure 21 Floating Point Subtraction Flowchart 
-107-
Float1ng Point Division Algorithm 
Floating point division is performed in the Floating Point Adder/Subtractor/ 
Divider according to a non-restoring binary divlsion algorithm described 
below. The d1v1sor 1S X, the dividend Y and the quotient is Z. The flow-
chart for the algorithm is shown 1n Figure 22. 
1. Subtract the exponents so that Z exponent = X exponent - Y exponent 
+ bias. 
2. Compare the mantissas. If X mantissa js greater than the Y mant1ssa 
shift X mantissa to the right one bit and increment Z-exponent. 
3. Set counter I to 62. Clear Z-mantissa (Z mantissa = 0). 
4. If X-mantissa equals zero, then stop. 
5. Perform subtraction X = 2X-Y. 
6. Test result of subtraction. If negative go to step 10, else 
continue. 
7. Set b1t I in quotient to 1 (Z(I) 1). 
8. Decrement I. 
9. If I does not equal zero, go to step 4, else stop. 
10. Add X = 2X+Y. 
11. Decrement I. 
12. If I does not equal zero, go to step 6, else stop. 
-108-
If (CA-1) S+S-B 
E+EX-EY FZE2+1 Else S+S+B 
RESET STATUS S~ Shift CA into T 
E+O CNTR-CNTR-1 YES FIN2+1 
NEITHER CNTR-O B+T 
B+MY RETURN 
U+ .. E 
YES FZE2+1 B+T 
S+O 
E+O 
FIN 2+1 S+T 
E+H+200B 
FDVZ+1 
S+T SZ+sXOSY 
S+A-B Shift S until 
CA+C 
out CNTR=O 
S+A Normalize 
and 
Round FNE2+SZ 
FOV2+1 
FIN2+1 
S+l 's 
E+2' s 
E+H+l 
Shift S right 1 
RETURN 
H+E FZE2+l 
FUN2+l 
S+O 
E+64 E+O SZ+O Clear T FIN2+l 
FNE2+SZ 
FIN 2+1 
CNTR+E 
RETURN 
Figure 22 Floating Point Division Flowchart 
-109-
HARDWARE ASSESSMENT 
The system architecture of an array of node processors controlled by a 
central minicomputer was chosen. The node processor design was based upon 
the computational requirements of a structural dynamics simulator. A 
microprogrammable blt-slice architecture allowed an extremely powerful 
custom instructlon set. Additional hardware was designed to greatly speed 
up floating point calculations. 
Every node processor may access a large dynamic memory. Nearest neighbor 
communications have been implemented for efficient lnterprocessor communl-
cations. 
The dynamlc memory, CPU to floating pOlnt unit interface, and two floatlng 
pOlnt units, were deslgned up through schematic diagrams. The CPU and the 
communicatlons interfaces were designed through hlock diagrams. 
The hardware archltecture chosen is as powerful as is possible while eco-
nomically feaslb1e with today's technology. As upgraded verSlons of the 
circuitry become avallable, the speed to cost ratio will lncrease. Walting 
for future advancement poses few advantages. New products are rarely ready 
on schedule. The hardware suggested here is ava~lable now to provide an 
extremely powerful and useful tool for structural dynamlcs simulatlon. 
-110-
DISCUSSION OF RESULTS 
The approach taken in this design of a Digital System for Structural Dynamics 
Simulation is innovative. From a hardware standpoint, the system takes 
advantage of decreasing costs and increased computational power of state-
of-the-art digital technology. The software associated with this system 
is of necessity state-of-the-art. The main concepts of value in the simu-
lation application are the segmentation of the problem (into 125 equal parts), 
a custom instruction set tailored for simulation, and the high speed of the 
computing hardware. 
At this point it is suggested that the detailed design of a node processor 
be completed and a prototype constructed. The instruction set should be 
microcoded and small programs should be written to exercise the hardware/ 
firmware. At the conclusion of this phase, the decision can be made as to 
purchase of a control minicomputer and subsequent production of the entire 
array of processors. 
The techn1cal risk of producing a single functioning node processor is not 
great and results primarily from the tedium of microcoding such a machine. 
Constructing the entire system represents a challenging problem in packaging, 
cooling, interconnect1ng, and testing. 
-111-
SUMMARY OF RESULTS AND RECOMMENDATIONS 
The subject of thls report has been the design of a Digltal System for 
Structural Dynamlcs Slmulation. 
The results of this program can be summarized as follows: 
1. A search of the field of parallel proceE.sing/simulatlon was done 
to dlscover work WhlCh would be a duplication of effort of this 
program. No such duplication was found. 
2. The princlples of slmulation modeling mE'thods were explored and 
a method (Runge-Kutta) chosen for this ('lass of problems. 
3. The archltecture of an array of processors was conceived as the 
best possible solution to the simulation problem. 
4. The node processor architecture of bit-slice microprogrammed CPU, 
large dynamic memory and custom floating point hardware was chosen. 
5. The floating point hardware and dynamic memory were designed to 
the detalled schematlc stage. The CPU 'vas designed to the block 
diagram level. 
6. The instruction set was designed and flJwcharted. ThlS infor-
mation is contained lTI the Node Processor Instruction Set Reference 
and the Mlcrocode Flowcharts for the Node Processor Instruct10n 
Set prov1ded to NASA as a separate report. 
7. System software requirements were outlined. 
For meaningful continuat10n of thlS program, addltlonal effort 1n the areas 
below lS recommended. 
1. A sample structural dynamlcs problem should be developed, its 
Solutlon coded and run on a maln frame computer. 
2. The hardware and firmware (microcode) for a slngle node processor 
should be prototyped. 
3. Segments of the sample problem should be run on the node processor 
with results being checked back to the maln frame solution. 
-112-
4. The control minicomputer should be specified and software design 
started for the system. 
5. The debugged prototyped node processor can then be produced in 
quantity. Packaging and interfacing to the central minicomputer 
followed by system checkout will complete the program. 
-113-
REFERENCES 
1. Burden, R., Falnes, J. and Reynolds, A., "Numerical Analysls", 
Prindle, Weber & Schmidt, Boston, Mass., 5th Printing~ August 1980. 
2. Chatsworth, A. E., "An Approach to SClentific Array Processing: The 
Architectural Design of the AP-120B/FPS-164 Family", Computer, Vol. 
14, No.9, Sept. 81, pp. 18-27. 
3. Advanced Micro Devlces, "The Am 2900 Family Data Book", Advanced 
Micro Devlces, California (1979). 
-114-
APPENDIX I 
Summary of Current Similar Simulation Programs 
H 
I 
t-' 
1. 
2. 
3. 
4. 
5. 
ORGANIZATION 
Unlv. of Wisconsin 
NASA Ames 
NASA Langley 
Goodyear Aerospace 
Rensselaer Polytechnic 
Instltute 
SUMMARY OF CURRENT SIMILAR SIMULATION PROGRAMS 
PURPOSE OF WORK 
Solution of partial dlffer-
ential equations 
Calculation of three-
dlmensional flows for 
aircraft. 
Solution of statlc finite 
element equation. 
Processing of satelllte 
imagery. 
Assessment of chip tech-
nology as related to 
structural engineerlng. 
APPROACH 
Net of computers. 
Nearest neighbor (2 dlmen-
sional array). 
Global bus structure 
Use of parallel pro§esslng 
concepts to get 10 FLOPS. 
Archltecture not based on 
physlcal problem. 
Array of asynchronous 
mlcroprocessors. Each 
connected to 12 nearest 
nelghbors for communica-
tlon of dlsplacements. 
Iteratlve solution. 
1",)~4-
Use of 128 x 128 array of 
processors to process 
slmultaneously bit serial 
data. Each connected to 
four nearest nelghbors. 
Custom VSLI CMOS/SOS chips 
are used for non-memory 
portions of each processor. 
Evaluation of current and 
future chip usage in numeri-
cal algorlthm evaluation. 
Predlction of lmpact of 
chip technology on numeri-
cal analyses. 
PRESENT STATUS 
Study Stage. 
System design. 
Four node system 
operating. Thlrty-
SlX node prototype 
under development. 
To be completed 
m 1982. 
Started ln 1980. 
To be completed 
ln 1982. 
APPENDIX II 
Summary of Relevant Papers Describing Parallel 
Processing Simulation Procedures 
1. 
2. 
3. 
4. 
5. 
SUMMARY OF RECENT PAPERS DESCRIBING PARALLEL PROCESSING 
REFERENCE 
Baskett, F. and A. J. Smith, 
"Interference in Multlprocessor 
Computer Systems with Interleaved 
Memory," ACM, Volume 19, No.6, 
June 1976, pp. 327-334. 
Baudet, G. M., "Asynchronous 
Interatlve Methods for Multi-
Processors" Journal ACM, Volume 
25, No.2, April 1978, pp 226-244. 
Bhandarkar, D. P., "Some Per-
formance Issues in Multipro-
cessor System Design," IEEE 
Trans. Computers, Volume C-26 
No.5, May 1977, pp 506-11. 
Enslow, P. H., "Multiprocessor 
Organiza tion-A Survey," 
Computing Surveys, Volume 9, 
No.1, Mar. 1977, pp 103-129. 
Kafura, D. G. and V. Y. Shen., 
"Tasks Scheduling on a Multi-
processor System wlth Inde-
pendent Memorles," SIAMJ 
Computing, Vol. 6, No.1, 
Mar. 1977, ~p 167-187 
COMPANY/ 
ORGANIZATION KEY WORDS SUBJECT 
Stanford Unlverslty, MEMORY INTERFERENCE, Analyzes the memory inter-
ference caused by several 
processors slmultaneously 
using several memory modules. 
Results are computed for a 
slmple model. 
Universlty of Californla,MULTIPROCESSING, 
Berkeley INTERLEAVED MEMORY 
TRACE DRIVEN 
SIMULATION. 
Carnegle-Mellon Unlver-
slty, Pittsburgh 
Australian Natlonal 
University, Canberra, 
Australia 
Georgia Instltute of 
Technology, Atlanta 
Iowa State Unlverslty, 
Purdue Universlty 
ASYNCHRONOUS ALGORITHMS,Asynchronous lteratlve methods 
ASYNCHRONOUS MULTIPRO- presented for solving a system 
CESSORS, PARALLEL AL- of equations. Conditions given 
GORITHMS, INTERATIVE to guarantee convergence. Ad-
METHODS, CHAOTIC RE- vantages of purely asynchronous 
LAXATION, ANALYSIS OF methods. 
ALGORITHMS 
MEMORY INTERFERENCE, 
MEMORY INTERLEAVING, 
MULTIPROCESSORS. 
COMPUTER SYSTEM OR-
GANIZATION, CONCURRENT 
OPERATIONS, INTERCON-
NECTION SUBSYSTEMS, 
MULTIPROCESSOR OPERAT-
ING SYSTEMS. 
SCHEDULING. ALGORITHMS, 
DETERMINISTIC MODELS, 
WORST-CASE BOUNDS. 
MEMORY CONSTRAINTS. 
Guidelines for multiprocessor 
system architect. Preferred 
design alternatlves and/or 
tradeoffs. 
Time-shared buses, crossbar switch 
matrix, multibus/multiport 
memories, interconnectlon systems 
dlscussed. Three operating systems 
master-slave, separate executlve 
for each processor, symmetrlc 
treatment of all processors 
reviewed. 
Scheduling strategles evaluated 
for system of identlcal processor 
with a prlvate memory. 
SUMMARY OF RECENT PAPERS DESCRIBING PARALLEL PROCESSING 
(Continued) 
REFERENCE 
6. Kinney, L. L., and R. G. 
Arnold, "Analysis of a Multi-
Processor System with a Shared 
Bus," Conference Proceedings 
5th Ann. Symp. Computer 
Architecturs, Palo Alto, CA 
April 1978, pp 89-95. 
COMPANY/ 
ORGANIZATION KEY WORDS 
University of Minnesota, FIFO, QUEUE, 
Honeywell Corporation, MLLTIBUS SYSTEM, 
Minneapolis FIFO SHARED-BUS. 
SUBJECT 
Analysis of a multiprocessor 
system with shared-bus. Determin-
ing the processing power as the 
number of processors is increased. 
7. Kuznia, C. H., R. Kober, and SIEMENS AG DISTRIBUTED MEMORY, 
MULTIPROCESSOR, 
COMMUNICATIONS MEMORY 
Multiprocessor system design for 
large systems of differential 
equations. 
H. Kopp, "SMS - A Structured Hofmannstr, Germany 
8. 
Multimicroprocessor System with 
Deadlock-Free Operation Scheme II 
Conf. Proc. 3rd Ann. Symp. 
Computer Architecture, Clearwater, 
Florida, Jan. 1976, p. 122 
O'Grady, E. P., "A Multiprocessor 
for Continuous System Simulation", 
Proceedings 1979 Int. Conf. 
Parallel Processing, Bellaire, 
MI, Aug. 1979, p. 306. 
9. Pearce, R. C. and J. C. Majithia, 
"Upper Bounds on the Performance 
of Some Processor-Memory Inter-
connections," Proc. 1967 Int. 
Conf. Parallel Processing, 
Walden Woods, MI, Aug. 1976, 
p. 303. 
Ar1zona State University SIMULATION, INTER- Simulation-oriented multiprocessor 
Temple PROCESSOR COMMUNICATION,system employing a new concept in 
University of Waterloo, 
Ontario 
BIT-SLICE MICROPROCESSORtnterprocessor communication is 
BUS CONTROL PROCESSOR. described. Parallelism in con-
junction with address-mapping 
CROSS-POINT, TIME-
SHARED BUS, PIPELINED 
LOOP, BINARY SWITCH. 
memories realize an efficient 
high-speed transfer mechanism. 
Multiprocessor performance evaluated 
for cross-point, time-shared pipe-
lined loop, and binary switch 
methods. 
SUMMARY OF RECENT PAPERS DESCRIBING PARALLEL PROCESSING 
(Continued) 
REFERENCE 
10. Pearce, R. C. and J. C. Ma]ithia 
"Performance Results for an 
M.I.M.D. Computer Organization 
US1ng Pipel1ned B1nary Switches 
and Cache Memor1es," Proc. Inst. 
Electron1c Engineers (England), 
Vol. 125, No. 11, Nov. 1978, 
pp. 1203-1207. 
11. Sastry, K. V. and R. Y. Kain, 
"On the Performance of Certain 
Multiprocessor Computer Or-
gan1zation," IEEE Trans. 
Computers, Vol. C-24, No. 11, 
Nov. 1975, pp. 1066-1074. 
12. Yang, Chao-Chih, "Gast Algo-
rithms for Bounding the 
Performance of Multiprocessor 
Systems," Proc. 1976 IntI. 
Conf. Parallel Processing, 
Walden Woods, Mich., Aug. 1976, 
pp. 73-82. 
13. Patel, Janak H., "Processor-
Memory Interconnect1ons for 
Multiprocessors," Conf. Proc. 
6th Ann. Syrup. Computer 
Arch1tecture, Ph1ladelphia, 
PA, April 1979, pp. 168-177. 
COMPANY/ 
ORGANIZATION 
University of Waterloo, 
Ontario 
Sperry Un1vac 
Rosev1lle, M1nn., 
University of Minnesota 
Un1versity of Alabama 
Birmingham 
School of Electr1cal 
Eng1neering, 
West Lafayette, IN 
KEY WORDS 
PIPELINED PROCESSING, 
CACHE MEMORIES, 
M.I.M.D. ARCHITECTURE 
SUBJECT 
Throughput performance w1th 
respect to var1at1ons 1n 
cache memory parameters, 
number of processors, processing 
time of a system 1n which a binary 
sW1tch 1S used as the 1ntercon-
nect10n network. 
ANALYTIC MODELS, IN- Performance of a multiprocessor 
STRUCTION EXECUTION system w1th d1fferent storage 
RATES, MEMORY CONFLICTS,allocations for 1nstruct1ons and 
MULTIPROCESSORS. data w1th interleaving 1n the 
1nstruction space 1S presented. 
PRECEDENCE PARTITION, Propos1ng two types of schedul1ng 
PARTIALLY ORDERED TASKS for more efficient execution of 
a mult1processor system. 
INTERCONNECTION NET- Interconnect1on networks proposed 
WORKS, CROSSBAR SYSTEMS,for processor to memory commun1-
DELTA NETWORKS cat10n and multiprocess1ng 
system allows a direct llnk 
between any processor to any 
memory module. 
End of Document 
