An array computer for the class of problems typified by the general circulation model of the atmosphere by Graham, Marvin Lowell & Slotnick, D.L.

LIBRARY OF THE
UNIVERSITY OF ILLINOIS
AT URBANA-CHAMPAIGN
510.84
IlGr
no. ?G I - ?&3
Cop. 2

Digitized by the Internet Archive
in 2013
http://archive.org/details/arraycomputerfor761grah


ZjUL
3
UIUCDCS-R-T5-T61
no. 7(j>/
AN ARRAY COMPUTER FOR THE CLASS OF PROBLEMS TYPIFIED
BY THE GENERAL CIRCULATION MODEL OF THE ATMOSPHERE
BY
Marvin L. Graham and D. L. Slotnick
~y
December 1975

nReport No. UIUCDCS-R-75-761
AN ARRAY COMPUTER FOR THE CLASS OF PROBLEMS TYPIFIED
BY THE GENERAL CIRCULATION MODEL OF THE ATMOSPHERE
by
Marvin L. Graham and D. L. Slotnick
December 1975
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, Illinois 6l801
*This work was supported in part by NASA Goddard Space Flight Center under
Grant No. US NASA NAS-5-2333^ and was submitted in partial fulfillment of
the requirements for the degree of Doctor of Philosophy in Computer Science,
December 1975? for Marvin L. Graham.

l(p\ -71
2j
TABLE OF CONTENTS
Page
1. Introduction 1
2. The Problem 3
2.1 General Circulation Models 3
2.1.1 Vertical Levels 7
2.1.2 Time 7
2.1.3 Horizontal Reoslution and Various
Differencing Schemes 9
2.2 GISS Modifications to the Model 12
2.3 The Effects of the Oceans on the Atmosphere 13
2.U Input and Output Requirement of the Model 15
3. The Array Computer 1°
k. The System Design 23
U.l System Parameters 23
U.l.l Word Size 23
U.1.2 Word Format 2U
U.1.3 Memory Requirements 26
k.l.k Measurements- of the GISS Model 27
k.1.3 Processor Speed Requirements 29
k.1.6 The Choice of TTL Technology for the Processor 30
k.2 The Processor Design 3^
U.2.1 Convention Used in the Figures Which Describe Logic . . 36
U.2.2 Signal Name Notation Used in the Design Description . . 38
H.2.3 Inversion in the Logic Figures 39

Page
U.2.U Detailed Description of Two Packages 39
Up
U.2.5 The Processor Design HC
U.3 Processor Intercommunication - The Routing Network 155
U.3.1 Routing Network Control ^o
U.3.2 ECL Logic 1 )^2
U.3.3 Routing Network Time and Component Count Estimates. . . l6k
U.3.1+ Table Look Up ^
U.3.5 Communication with the Control Unit and the
Input-Output Channel 1?9
k.k The Control Unit l81
U.U.I Control of the Processor Array l81
k.h.2 Control of the Routing Network 1°3
5. Design Testing 1 ° 5
5.1 The Logic Simulation System l8 5
5.1.1 The Logic Simulator Language and the Preprocessor . . . 188
5.1.2 Timing "by the Simulator 1°1
5.1.3 Debugging Aides in the Simulation System 193
5.1.1+ Simulated Packages with No Exact Hardware Analog. . . . 195
5.1.5 Loops 19^
5.1.6 Wiring Lists 2°9
5.2 The Multiplier Prototype 210
. P1Q
6. System Performance '
6.1 Processor and Routing Unit Cycle Times 219
6.2 Performance of the System on the General Circulation Model . . 221

Page
6.2.1 The Rectangular Model 227
6.2.2 The Split Grid Model 228
6.2.3 The Polar Circle Sum 231
6.2.4 A Hardware and Time Comparison of the Clos, Omega and
Nearest Neighbor Routing Schemes 231
6.3 Image Data Processing 234
6.3.1 Image Data Clustering 235
6.3.2 Image Data Classification 237
6.3.3 Byte Packing and Unpacking 239
6.4 File Processing and Information Retrieval 243
6.4.1 File Statistics 243
6.4.2 Information Retrieval 244
6.5 Matrix Inversion by Gaussian Elimination 245
6.5.1 Solution of Inhomogeneous Systems 245
6.5.2 Inversion of a Matrix 253
7. Operating Parameters of the System 254
8. Conclusion 263
p6IlReferences ^DH
Appendix 2o9
VITA 293

1. Introduction
The goal of the research described hy this paper vas
the design of
a computer suited to the class of problems typified
by the general circula-
tion model of the atmosphere. The research vas
supported in large part by
the Goddard Institute for Space Studies (GISS) of the
National Aeronautics
and Space Administration (NASA). The needs that prompted
GISS to support
the research imposed several practical constraints on
the design which vas
sought. A fundamental goal vas that the machine vhich resulted
from the
design vas to have roughly 100 times the computing capability
of the GISS
IBM 360/95 vhich is nov used for research vith a general circulation
model.
Their desire to increase the spatial resolution of that model by
refining
the grid implied the need for a 100 fold increase in computing
capability to
stay even in terms of the real time.
A second requirement vas that the resulting machine be programmable
in a higher level language similar to FORTRAN. The current model
is vritten
almost entirely in FORTRAN, and the GISS staff planned to modify an existing
compiler for CFD - a FORTRAN-like language - for ILLIAC IV for use vith
their
nev machine. Moreover, the nev machine vas to cooperate in the general
circulation experiments on the expanded models vith the IBM 360/95; the IBM
machine vould continue to be used for the pre-processing and post-processing
of model data vhich it nov performs for the smaller model vhich it also nov
executes. The implication of the FORTRAN and IBM machine constraints is that
the machine possess floating point arithmetic capability, and that the float-
ing point format of the machine be close to that of the IBM 360 series.
A third constraint on the design vas that the cost of the machine
resulting from the design effort vas to be significantly less than that of
other extant machines of similar computing capability. Among these are the
ILLIAC IV, the Texas Instruments Corporation Advanced Scientific Computer,
and the Control Data Corporation STAR.
A final constraint on the design was that it be feasible to
fabricate a complete system and put it in operation by early 1978. A clear
implication of this and the preceding constraint is that there is neither
time nor money for the development of new hardware families, let alone new
chips. The design will have to be made in terms of an existing hardware
family with components readily available off-the-shelf.
2. The Problem
Several groups in the United States are working on global general
circulation models. The three largest efforts are those of Mintz and
Arakawa at UCLA (Arakawa, 1972; Mintz, 197*0, Smagorinsky and Manabe at
the Geophysical Fluid Dynamics Laboratory (GFDL) (Smagorinsky, 1963) and
Kasahara and Washington at the National Center for Atmospheric Research
(Kasahara, 1967). The UCLA model is of primary interest to this research
because the model run by GISS (Tsang, 1973) is a modified form of that
mo del
.
2.1 General Circulation Models
A general circulation model simulates the behavior of a three
dimensional spherical atmosphere on a digital computer. The bulk of the
computing load necessary in the simulation is the time integration of the
equations of fluid dynamics of the atmosphere. In the UCLA model, sub-
routines called C0MP1 and C0MP2 perform this time integration of the
equations of motion. Every six cycles through C0MP1-C0MP2, the effects of
solar radiation in heating the atmosphere and the effects of evaporation,
condensation and precipitation are introduced through the execution of the
C0MP3 and COMPU subroutines. The process is shown in Figure 2.1.2-1. Every
four cycles through the process illustrated by Figure 2.1.2-1, a table look-
up process is used to introduce the effects of long-wave infra-red energy
absorbtion. in the GISS model.
Table 2.1-1 lists the parameters which define the conditions under
which the model operates. Table 2.1-2 lists the variables of the model and
gives their spatial dimensions. Figure 2.1-1, which is taken from a GISS
Prescribed parameters .
To use the atmospheric general circulation model, for this or any
other planet, the following parameters must be prescribed:
Radius, surface gravity and rotation speed of the planet.
Solar constant, and orbital parameters of the plant.
Total atmospheric mass.
Thermodynamical and radiation constants.
Geographical distributions of open ocean, ice covered ocean,
bare land and land covered by glacial ice.
Elevation of the bare land and glacial ice.
Surface roughness.
Thickness of the sea ice.
Ocean surface temperature.
Table 2.1-1. The Parameters of the General Circulation Model
Variables of the Atmospheric Model
Horizontal Velocity
West to East component
South to North component
Temperature
Water Vapor (specific humidity)
Surface Atmospheric Pressure
Parameters of the Planetary Boundary Layer (PBL)
Boundary Layer Depth
Temperature Discontinuity at the PBL
Moisture Discontinuity at the PBL
Parameters of the Earth's Surface
Ground Temperature (X,Y)
Ground Water Storage (X,Y)
Mass of Snow on the Ground (X,Y)
A Future Variable of the Atmospheric Model
Ozone Concentration (X,Y,Z)
Table 2.1-2 The Variables of the General Circulation
Model and their Dimensionalities
U(X,Y,,z)
V(X,Y,,z)
T(X,Y,,z)
q(X,Y,,z)
P (X,Y)
(X,Y)
(X,Y)
(X,Y)
4r+/k x V + V $ + actVir = Fdt J a
pa = AT
dt'o r1
1 3$
7^7 = "a
at
Here the notation is
V horizontal velocity
t time
/ Coriolis parameter
k vertical unit vector
V two-dimensional gradient operator
a
a the vertical coordinate [ = (p-p. )/(p -p . )
J
"C S T>
p pressure
p pressure at top of model atmosphere, constant
p pressure at bottom of model atmosphere
a specific volume
* Ps " Pt
F horizontal frictional force
R gas constant
T temperature
6 potential temperature
a specific heat at constant pressure
P
Q heating rate per unit mass
geopotential
q water vapor mixing ratio
C rate of condensation
E rate of evaporation.
Figure 2.1-1 The Primitive Equations and the Variables
of the GISS General Circulation Model.
report on the model (Somerville , 197*0 » shows the basic equations of the
model. The remainder of this section will describe the UCLA and GISS models.
The emphasis will be on describing the differences between the first UCLA
model (Arakawa, 1972), the GISS model which evolved from it (Somerville, 197*+;
Tsang, 1973) and the second UCLA model (Mintz, 197*0 to illustrate the range
over which variations of the current GISS model may run in future models.
2.1.1 Vertical Levels
The first UCLA model had only three vertical levels. The current
GISS model has nine, and the second UCLA model has twelve. GISS hopes to
expand to a fifteen level model. The new UCLA model incorporates a special
"sponge layer" as its highest level to damp out spurious numerical wave
reflections (Mintz, 197*0-
2.1.2 Time
The first UCLA model and the GISS models use the explicit matsuno
predictor-corrector method for advancing time. For a variable Q, the
scheme uses a forward and a backward step to advance time by one interval
in the following way:
Forward Q(t ._ ) - Q(t )
n+1 n
= f'(Q(t ))
t \ , - t " n'n+1 n
Backward Q(t ) - Q(t )
= f'(Q(t_J*)
t J__ - t
VMyV
n+1'
n+1 n
The forward step uses the current values of the variable and the function f,
which approximates the derivative, to produce an estimate, Q(t ) , for the
value of the variable at the next time. The backward step uses the estimated
value to compute Q(t ), the value of the variable at the next time. The
process is illustrated by Figure 2.1.2-1.
to
Q.
UJ
UJ
2
N -
to - +
in - +
* -
CM -
to
CJ
10
+
O
CM
0.
2
O
u
1
r-l
0.
2
o
o
II
*
CD
L 0)
UJ
2
H
Z
a
<
u
<
CD
Q
Z
<
Ul
o
<
0.
CO
z
Q
UJ
cc
UJ
r-
Z
UJ
o
CM
M
o
5-i
H (X,
•H C
+
t-
o
^
10
+ CM
it
*2
10
+
ro <t
a. a.22oo
o o
CD -H
P4 -P
CO cc)
L
UJ
UJ 2
r-
z
z —
a
cr <
< *
<-» S< CD
m
a
z <
<
UJ
UJ £
OT
z
z
— o
Ul
a rx
UJ UJ
a: r-
UJ z
1- UJ
z u
UJ z
z
b.
x" 3
52 i
oc z
> S
a- o
=> O
ro* ^r
ii ii
rtj bO
g a)
ro -p
+
o
II
*5
in
+
I-
C\J
ii
*2
CD <U
-p a
CO -H
EH
CO
a <v
•H ,ci
EH -PL
O -H
+
t-
o
r-l
II
*2
+ CM
N
*5
UJ
2
0) bO
CD O
CJ1 CD
CD U
t_
Z
CO CD
CD <H
,d -H
Eh Q
CM
+
1-
o ii
*2
IO
+ CM
ii
*2
Q
K
<
*
o
H
CM
H
L a
z
<
CM
CD
+
o
r—
1
II
*2
CM
+ <*
ii
*2
UJ
2
to
a;
i
•H
L Q
UJ
oe
t-
o
*-•
II
*2
rl
+
<C3
ro
ii
*2
UJ
r-
Z
UJ
o
II
2 2 2
o
o:
<
ft
O
U.
t
<
i
<
The GISS version of the model for the IBM 370/165 takes advantage
of the fact that only one complete copy of the variables is needed for this
method to reduce the storage requirements of the model "by roughly half.
The new UCLA model uses the leapfrog scheme to advance time. This
scheme computes a value for the variable A at time t ^ as follows:
A(t ._) - A(t )
n+1 n"X
= f(A(t )).
2(t
n+l } " *a
This scheme takes half the computer time, but requires twice the space of
the Matsuno scheme, since two complete sets of the variables are required to
compute a new value. The leapfrog scheme is numerically superior to the
Matsuno scheme in that it does not amplify or damp the solution, but it is
inferior in that it tends to produce two separate and divergent solutions.
The new UCLA model will couple these two solutions by introducing one
Matsuno step for every six leapfrog steps.
Figure 2.1.2-1, taken from Tsang (1973), shows the sequence of
computation in the current UCLA and GISS models. Each normal time step
conisists of a C0MP1-C0MP2 call for a forward (estimator) time step and
another C0MP1-C0MP2 call for a backward (corrector) step. Every six normal
steps, the effects of solar radiation and evaporation are computed by a call
on C0MP3 and COMPU. The value of the variable M determines which form of
the difference algorithm will be used in the C0MP1-C0MP2 routines. The
following section discusses the need for the spatial difference variations.
2.1.3 Horizontal Resolution and Various Differencing Schemes
Both UCLA models and the most frequently used version of the GISS
model have 72 points around circles of latitude, and k6 circles of latitude
10
from pole to pole (including the poles). For the next decade GISS is inter-
ested in models of two different sizes for the proposed computer (Halem, 197*0
Both models will have 15 vertical levels (i.e., 15 spherical shells) and
differ only in the number of points around the equator of the model. The
two sizes of interest are:
1. A model with 128 points around the equator and with 96
circles of latitude. We will call this the 96 x 128 grid.
2. A model with 256 points around the equator and with 192
circles of latitude. We will call this 192 x 256 grid.
All of the models use a stagered grid system, which stores the
values of the primary meteorological variables at different points in space.
Figure 2.1.3-1, which is taken from (Mintz, 197M , shows five grid schemes
which have been considered. The first UCLA model and the current GISS model
use scheme B. Arakawa has decided to use scheme C in the new UCLA model.
The basis for this decision, which follows in the next paragraph, illustrates
the intricacy of the model.
Convection of moisture from the earth's surface to high altitudes,
called cumulus convection , is an important atmospheric phenomenon,
especially in the tropics. The scale of this motion is tens of kilometers;
the distance between grid points at the equator is 156 kilometers even for
the 256 point model. Arakawa found a means to parameterize cumulus cloud
convection so that its effects could be felt by the model in spite of the
fact that direct simulation - as the model does for winds, temperature and
specific humidity - is not possible. The parameterized cumulus convection
produces rising and subsiding air motion which frequently occurs in a
(A)
j+1 u^ ^JL
]<
J-l*
u,v,h
u,v,h
u,v,h
u,v,h
u,v,h
u,v,h
u,v,h
i-1 l i+l
11
j+l
(B)
h
j-l
• t> <»
u,v u,v
• •
u,v u,v
• •
h h
O II O
l-l I
—
d— l+l
J+lf
j-l
(C)
h u Ji u
L-^L
i-l i+l
(D)
ill <,h •v i(h ,v f
J
+ 1 <
< ,
u
< ,
u
h .v ,h
•
V
II
J
'
»
u
i >
u
i —1
1
h V h V
i-1 L i+
u: the west to east component of the horizontal flow
v: the south to north component of the horizontal flow
h: the distance from the surface to the top of the atmosphere
in the model
Figure 2.1.3-1 Staggered Grid Schemes
12
checkerboard pattern. To use scheme B for the grid layout, one must average
the values of pressure at the corners of each grid square to compute the
effect of pressure on the flow fields. Rising motion at one corner is
cancelled by subsidence at another, and the net effect is that the cumulus
convection goes unnoticed by the model. Arakawa devised the intricate time
and space difference scheme shown in Figure 2.1.2-1 (taken from Tsang, 1973)
to counteract this insensitivity . The differencing scheme uses a cycle of
space centered and uncentered differences to permit the checkerboard pattern
produced by cumulus convection to influence the model. When grid scheme C
is used, these elaborate gyrations are unnecessary. Primarily for this
reason, Arakawa has decided that scheme C will be used in the next UCLA
model. The current model, which is the basis for the GISS work, uses
scheme B.
2.2 GISS Modifications to the Model
Several modifications of the UCLA model were made by GISS. Only
one of these has a major impact on this research. This is the distinctly
different approach to the treatment of high latitude regions which GISS has
adopted, and which they call the split grid model.
The meridian lines on a sphere get progressively closer as they
approach the poles. The Courant stability criterion (Fox, 196l), c At < Ax.
where c is the highest velocity in the model, requires that a very small
time step be used to avoid numerical instability in these regions. The UCLA
approach to this problem is to smooth across progressively wider bands of
meridional lines as the meridians get closer together. The GISS approach is
to progressively reduce the number of meridians by a factor of two as one
moves from the equator to the poles. This divides the sphere into several
13
regions as illustrated in Figure 2.2-1. Within each region, the number of
meridians is constant. The region boundaries are chosen to keep the inter-
meridian distance roughly constant for all regions. In the split grid
model, the need for zonal smoothing is much reduced but not completely
eliminated. Table 2.2-1 shows the number of split grid regions for grids
with different numbers of points on the equator.
Meridians at Number of split
the equator grid regions
72 5
128 T
256 11
512 15
Table 2.2-1. The Number of Split Grid
Regions for Various Model Sizes
The split grid model offers two advantages over the UCLA smoothing
approach. The first is that a larger time step can be used throughout the
model, since the smallest increment in the "x" direction is larger in a split
grid model. Also, there is a potential storage saving for the split grid
model. The split grid scheme does have the liability that it is more
difficult to program.
Whether a rectangular UCLA-style model or a GISS split grid model
is used, some averaging of polar values must be done. Thus, there is a
clear inherent parallelism in the processing which strongly suggests parallel
computation on circles of constant latitude.
2.3 The Effects of the Oceans on the Atmosphere
Until recently, meteorologists have assumed that the effects of the
oceans on heat transfer from the equator to the high latitude regions was
negligible. Lately, however, this view has changed, as evidenced by the
lU
Figure 2.2-1 The GISS Split Grid Model
15
relatively large emphasis on ocean modelling at the UCLA workshop (Mintz,
197U), and by the decision of the UCLA group to couple an ocean model and
an atmospheric model in a future model. Whereas the atmospheric equations
are integrated in time by explicit numerical methods, Semptner of the UCLA
staff indicated that all known ocean models advance time by successive
over-relaxation - an implicit method (Semptner, 197*0. He also feels
that IBM 360 single precision arithmetic is sufficient for solving the
system of equations for a h6 x 72 grid.
Semptner also cited work at GFDL (Manabe, 1969) which indicates
that integration of the atmospheric equations consumes UO times the amount
of computer time as does integration of the ocean equations for the same
simulated time. This dramatic difference results from the differences in
the two fluids, and the fact that the implicit solution scheme permits
the use of significantly larger time step than an explicit scheme would.
While it is clear that an ocean model will be required to improve
current results, it is not clear what the details of the ocean model must
be. Recent observations and numerical work (Mintz, 197*0 have shown the
existence of small scale (UO-50 kilometer) ocean phenomena. Whether these
are important, and if so, whether their effects can be parameterized (as
was cumulus convection in the atmospheric model) is yet to be shown. The
potential need for an ocean model coupled to the atmospheric model will be
most explicitly reflected in the size of memory that we recommend.
2.k Input and Output Requirement of the Model
The proposed mode of operation for the new machine is that it
receive its program and initial data from the GISS IBM 360/95 by using an
IBM channel with a data handling capacity of 6(10) bits per second.
16
A problem thus received -would be run in stand-alone fashion by the
machine with periodic dumps of model status. The current GISS model writes
an output record every two hours of simulated model time for a 1+6 x 72 x 9
grid. Table 2.1+-1 shows the variables which constitute these records, the
sizes of the records for a 1+6 x 72 x 9 , 96 x 128 x 15, and a 192 x 25 x 15
grid, the lower bound on the elapsed time to write the record using the
channel at its maximum rate, and an estimate of the computing time required
for the new machine to compute two hours of model simulation.
BYTES
DATA 1+6 x 72 x 9 96 x 128 x 15 192 x 256 x 15
TAU 1+ 1+ 1+
C(300) 1,200 1,200 1,200
Q(NS,EW,V,M 1+76,928 2,9^9,120 11,796,1+80
P (NS,EW) >.
TS (NS,E¥)
SHS(NS,EW)
J
13,2U8 1+9,152 196,608 each
GT (NS,EW)
CW (NS.EW)-'
Total 5^,372 3,196,081+ 12,780,721+
Transmission
Time 0.726 1+.26 17.01+ Seconds
Estimated
Computation
Time 0.031 0.39 3.15 Seconds
Table 2.1+-1 Record Sizes and Transmission
times for Various Grid Sizes
IT
As Table 2.U-1 makes clear, data output from the model will have
to come at less than two hour simulated time intervals if the
machine is not
to become heavily output bound. It is doubtful that channel
transmission
capacity can be increased nearly enough to reduce to output time signifi-
cantly with respect to the computation time.
18
3. The Array Computer
A computing capability improvement by a factor of 100 over the
capability of the 360/95 is a big order. In the time span specified for the
development of this design, there is no hope of achieving this improvement
purely by increased raw hardware speed. Indeed, physical realities such as
the bound imposed by the speed of propagation of electromagnetic waves
may make this path forever impossible. Clearly, if the capability increase
can be achieved, it must be achieved by using a machine organization
different from that of the 360/95-
The approach we shall take is to organize the machine as an array
processor. Applications research (Carroll, 1967) for an early array
processor, the SOLOMON (Slotnick, 1962), has shown that the array processor
organization is ideally suited to the class of problems that the general
circulation model typifies: solution of partial differential equations on
a large grid. Indeed, the GISS general circulation model has been
success-
fully converted for execution on the ILLIAC IV (Slotnick, 1968), the only
operational large scale array processor.
Figure 3-1 contrasts the organization of an array processor with
that of a conventional computer. In a conventional machine, control
hard-
ware (shown in the figure collected into one functional block and labelled
the Control Unit) interprets the instruction stream and provides signals
which control the operation of the rest of the hardware, collected
into
the block called the Arithmetic Unit. In most conventional machines,
both
the instructions and the data are stored in one memory. In most
conventional
computers, as suggested above, the control and arithmetic - or
execution -
19
CONTROL CONTROL ARITHMETIC
UNIT UNIT
n
INSTRUCTIONS
a
DATA
MEMORY
INSTRUCTIONS
I
ARITHMETIC
UNIT
DATA
MEMORY
CONTROL
UNIT
CONTROL
I
ARITHMETIC
UNIT
MEMORY
1
DATA ARITHMETIC
UNIT
• • •
i
• • • MEMORY
Figure 3-1 The Basic Structures of a Classical
Computer and an Array Computer
20
functions are seldom as clearly separated as the figure suggests. In the
array computer, however, the control and execution functions are clearly
separated. The arithmetic unit is replicated many times (102U in the
SOLOMON (Slotnick, 1962) and 6k in the ILLIAC IV (Slotnick, 1968)), and
the data memory is divided so that each of the arithmetic units operates
on its own data stream under the control of one common program. In a con-
ventional computer, conditional tests on data values in the single data
stream alter the flow of the single instruction stream. In the array pro-
cessor, residual local control in the processors of the array permits
conditional tests on data to allow individual processors to skip executing
instructions. In a standard technique for controlling iterations, the
control unit samples the activity status of the processors in the array, and
stops the iteration when all of them become inactive.
Application studies reported by Kuck (Kuck, 1968) have shown that
another local control feature is a vital element in an array processor. The
ability of each processor to index a control unit supplied data address
permits much more flexible use of the processors in the array. In the
general circulation model, processor level indexing is necessary to support
the table look up process used in the radiation calculation phase of the
model
.
Virtually all problems for which array processors are suited re-
quire that the processors in the array exchange data values. In the
SOLOMON computer, the 102*+ processors were arranged in a square thirty-two
processors on a side, and each processor could access the memories of its
four nearest neighbors in addition to its own. The sixty-four processors
21
of the ILLIAC IV are also arranged in a square, and each processor can re-
ceive values from its nearest four neighbor processors. In the design
described in this paper, we use a separate routing network model after the
suggestions of Lawrie (Lawrie, 1973) which permits much more flexible inter-
processor communication. Figure 3-2 shows the design described in the re-
mainder of this paper in block form. The machine includes a control unit,
256 array processors and their memories, and a sixteen unit three stage
routing network.
22
CONTROL
UNIT
* X *
256
PROCESSORS
16
ROUTING UNITS
DATA:
-
*
!
256
MEMORIES
CONTROL:
Figure 3-2 Block Diagram of the System
23
U. The System Design
The following sections will describe the system design. The initial
sections will establish the important parameters of the design. Subsequent
sections will discuss the arithmetic processor, routing network, and control
unit of the system.
k.l System Parameters
In this group of sections, the basis for the word length, memory
size, and other basic system parameters choices are given.
U.l.l Word Size
The UCLA and GISS models run in single precision of the IBM System/
360 (Arakawa, 1972; Tsang, 1973). Williamson and Washington of the National
Center for Atmospheric Research (NCAR) performed precision experiments with
the NCAR model (Williamson, 1973). Normally, the CDC machines on which that
model runs operate on a forty-eight bit fraction. Through software means,
they ran twenty-four and twenty-one bit test cases, and compared the result
with a forty-eight bit control runs. They concluded that "the lower-precision
arithmetic planned for the next generation of computers [that is, twenty-four
bit fractions] does not seriously affect the results from the current NCAR
[five degree, six layer] global circulation model." Dr. Larry Gates of the
Rand Corporation has recently rescinded his decision to run the Rand modifica-
tion of the UCLA model in double precision (Gates, 1975). He said that
difference between single and double precision test runs are well within the
so-called "predictability error" for hydrodynamics calculations discussed by
Lorentz (Lorentz, 1963).
On the basis of the above information, we have decided that single
precision arithmetic is sufficient for the execution of the model.
2k
U.1.2 Word Format
The system was designed to operate in conjunction with IBM series
360 computers at GISS. Data preprocessing steps to prepare input for the
system and data post processing steps to analyze the results of experiments
will be done on the IBM equipment. Programming for the system is to be in
a FORTRAN-like higher level language, so that floating point operation is
required. Because of the cooperation required between the system and the
360, it was decided to make the floating point format of the mcahine the same
as that of the 360 (IBM, 1970). The floating point format for the design is
shown in Figure h. 1.2-1. A floating point word is represented in sign magni-
tude form by a one bit sign, a seven bit exponent, and a twenty-four bit
fraction. A zero sign bit is used for non-negative numbers. The seven bit
exponent field contains a biased representation for exponent vlaues between
minus sixty-four and plus sixty-three inclusive. The proper representation
for an exponent value is found by adding the value to the bias, sixty-four.
Thus, for example, an exponent field value of kl base sixteen represents an
exponent value of plus one. The magnitude part ot the number is a proper
fraction; that is, the exponent is an implicit binary point at the left
of the most significant fraction bit. The exponent field represents the power of
sixteen which must multiply the fraction to correctly express the value of
the floating point number as a whole. Because the exponent radix is sixteen,
a change of one in the exponent value requires a shift of four bit positions
in the fraction to represent the same numerical value. Thus, the twenty-four
bit fraction can be regarded as a six hexidecimal digit fraction; each
hexidecimal digit is represented by four continguous bits of the fraction,
and shifts of the fraction are made in multiples of four bit positions.
.
25
EXPONENT FRACTION
8 9 32
Figure k. 1.2-1 The Floating Point Word Format
26
1+.1.3 Memory Requirements
Based on experience with the cost of development of special high
data rate disk systems which we obtained with ILLIAC IV, we decided that
the memory of the machine should be large enough to contain all of the
data. The memory requirement was estimated by running the COMMON for
the
360/95 model through the IBM FORTRM/H compiler. Space for four three
dimensional variables (two velocity components, salinity and temperature)
and one two dimensional variable (the vertically averaged stream function)
of an eventual ocean model was added for the 96 x 128 and 192 x 256
models.
Because that machine would have a program memory separate from its
data memory for the processor array, space for the program is not
included
in the following estimates. Table k. 1.3-1 displays the amount of
memory
required for several sizes of the model, including the 96 x 128 and 192
x
256 models with oceans.
words of memory
NS x EW x Z no ocean 7
level ocean
82 x 128 x 15 1,378,1+11
96 x 128 x 15 1,613,289 1,969,61+1
128 x 200 x 15 3,358,601
256 x 1+01 x 15 13,1+57,305
l6U x 256 x 15 5,506,125
192 x 256 x 15 6,1+145,385 7,870,793
Table h. 1.3-1
The machine should be built with 2
23
words of memory to accommodate
the 192 x 256 grid. Each of the 256 processor memories would
have 2 ' (or
32768) words. Each of these words will contain thirty-two
information bits
27
and six Hamming code bits (Hamming, 1950) for detection and correction of
single bit errors. The decision to include error detection and correction
hardware was taken on the advice of the staff of the University of Illinois
Physics department. They have constructed semi-conductor memory for their
computer, and found that the error detection and correction bits which they
included were well worth while, both in terms of improved system operation
and increased maintainability (Downing, 197*0 •
U.l.U Measurements of the GISS Model
To discover the relative importance of multiplication and the
frequency of double precision operations in the execution of the model, the
GISS model was run for one time step on the University of Illinois' 370/158
under the control and observation of a program which computes the frequencies
of all instructions executed by the program it observes. A series of runs
was made to permit instruction counts for the important parts of the model
to be determined. Execution times for these parts of the model were deter-
mined by the GISS staff (Kara, 197*0 during a one man year effort which
produced an ILLIAC IV version of the GISS model. Table U.l*. U-l shows the
number of instructions executed in each of three parts of the model, the
360/95 time for execution of those parts, and the instruction processing
rate of the 360/95 • Table U.l.U-2 gives the frequencies for single and
double precision floating point multiplications and divisions in the parts
of the model.
Approximately half of the instructions executed were floating
point instructions. These were nearly equally divided between addition and
subtraction on one hand and multiplication and division on the other. The
28
Part of the Model
Initialization
C0MP1-C0MP2
C0MP3
Instructions
11,891,631
69,1+80,878
1+3, 505,137
360/95 Time
10.3 sec.
6.5I+ sec.
360/95 Rate
6.75 MIPS
6.65 MIPS
Table 1+.1.1+-1 Measurement Values
360 Instruction Initialization
MDR 1
MD 330
MER 756
ME 2.13U
DDR 3
DD 1
DER 77
DE 1,773
C0MP1-C0MP2 C0MP3
1+23,936 132,1+80
16 103,765
2,221,358 823,153
1+, 022, 9 1+7 2,056,291
105,981+ 33,120
359,581+ 615,025
1+1+0,950 929,372
Table 1+.1.1+-2 Instruction Counts
29
ratio of multiplications to divisions (weighting C0MP1-C0MP2 by six to
account for the more frequent use of these routines in normal model execution)
is 6.15 multiplications to one division. The vast majority of the double
precision floating point operation are performed by one assembly language
subroutine which raises a number to a constant power. This routine uses
double precision because the speed of single and double precision operations
on the IBM 360/95 is the same. An approximation formula with a few more
terms can be used without requiring any double precision.
On the basis of the above information, we decided to design a
single precision processor whose floating point addition and multiplication
times are comparable. Double precision operations will be performed on the
single precision hardware of the design relatively slowly since they occur
with such low frequency.
4.1.5 Processor Speed Requirements
The system is to have roughly one hundred times the processing
capability of the IBM 360/95 for the weather model. As we saw in section
k.l.k, the 360/95 executes approximately 6.7 (10) operations per second
on the GISS general circulation model. We have already decided that the
machine we design will be an array processor with an architecture similar
to that of ILLIAC IV. How many processors should the machine have? To
Q
achieve 6.7(lO) operations per second, a 256 processor machine must
perform one operation in 382 nano-seconds; a 512 processor machine need
only perform one operation in 764 nano-seconds. On the other hand, as
we will see in section 4.3 - which discusses the routing network - it is
important to have the number of processors be a perfect square: 256 is the
30
square of sixteen, but 512 is not a perfect square. Moreover, a 256 pro-
cessor machine will "be more reliable and have a higher availability than a
similar 512 processor machine. Therefore, we will design a machine with
256 processors. We would, therefore, like the operation time for a processor
to be on the order of U00 nano-seconds
.
k.1.6 The Choice of TTL Technology for the Processor
It was clear from the outset that the time and budget constraints
on the design necessitated using an existing integrated circuit technology,
and in fact a family which is currently commercially available "off the
shelf". The choice must be either TTL, MOS , or ECL (Hnatek, 1973). A
higher level of integration (that is, more powerful individual packages is
avalable in the TTL family than is available in the ECL family. Moreover,
the new Schottky variant of TTL logic is nearly as fast as ECL. The speed
of MOS logic is far slower than that of even standard TTL. A floating
point processor with a fast multiplier will surely require using several
hundred integrated circuits in its design. Fewer high level packages are
required than low level packages to achieve the same functions, and package
savings pay off in both board and interconnection savings. Therefore, we
chose to design the processor in terms of TTL integrated circuits.
Package savings in the processor design result from the use of
two different package interconnection properties of two different special
forms of TTL logic. These are discussed in the following two sections.
1*. 1.6.1 Open Collector Logic and the Wire AND
A standard TTL output stage is shown in Figure k. 1.6. 1-1. The
active pull-up provided by transistor Ql is that it permits faster operation
31
+ 5 VOLTS
^X
01
•O OUTPUT
Figure h. 1.6. 1-1 The Standard TTL Totem Pole Active Pull-up Output Stage
32
than that of the resistor-transistor (RTL) or diode-transistor (DTL)
families from which the TTL family evolved. The passive output stage of
Figure k. 1.6.1-2 of the DTL family is used in some of the slower of the
TTL integrated circuits. Deletion of the pull-up resistor of the passive
output stage results in the so-called output collector output. Open collector
outputs of several packages can be wired together through a common external
pull-up resistor. If all of the output signals so wired together are logic
ones, each circuit will source less than one milliamp so the resulting current
flow for the entire collection of wire ANDed circuits results in a logic
one. However, if one or more of the wire ANDed output signals is a logic
zero, the corresponding circuits will sink on the order of forty milliamps,
so that the resulting voltage level of the ensemble falls to that of a logic
zero.
Within the processor, the open collector outputs of the Signetics
82U3 eight position scalers used in the right operand alignment shift logic
and the normalization left shift logic are wire ANDed together. An enable
signal for the device permits forcing all eight output signals to logic
ones regardless of the state of the eight input signals. One of the two
shift networks is enabled at a time, so that its output bits, ANDed
with ones
of the disabled device, determine the net output of the ensemble.
h.1.6.2 Tri-state Logic and the Wire OR
The National Semiconductor Corporation holds the patents for
another output control technique which they refer to by the registered
trademark "tri-state" logic. Standard TTL circuits augmented by the
National technique have an enabling input which can be used to force
the
33
-O OUTPUT -O OUTPUT
PASSIVE
PULL-UP
OPEN
COLLECTOR
Figure h. 1.6. 1-2 TTL Passive Pull-up and Open Collector Output Stage
3k
outputs of the device to a high impedance state (Hnatek, 1973). The output
impedance of a standard TTL output is nominally fifty ohms. The output
impedance of a disabled tri-state output is nominally 50,000 ohms. Thus,
if several tri-state outputs are wired together and all but one of them are
disabled, the current into or out of the disabled outputs is negligible com-
pared to that for the one enabled output. Up to one hundred or more tri-
state outputs can be wired together on a single bus. The resulting wired
connection is usually referred to as a wired OR, and its logic state is
determined by the logic state of the enabled output.
The processor design makes extensive use of tri-state devices
to reduce the need for selectors between otherwise competing signals.
h. 2 The Processor Design
A simplified block diagram of the processor is shown in Figure 14.2-1.
The names in the blocks of this figure (with the exception of the 2/1
Selector blocks) are the names of the Figure or Figures which present the
logic of that block in more detail. Each of these blocks is described in
detail in the following sections.
Multiplication is performed by logic external to that shown in
Figure U.2-1. The two twenty-four bit operands to be multiplied are sent to
the multiplier as shown, and both the most and least significant halves of
the product are returned. See section U.2.5.2.U and (Stenzel, 1975) for a
detailed description of the multiplier.
The processor as a whole is a large combinatorial circuit which is
conditioned by control signals from the control unit. It operates in steps
governed by one clock pulse. A typical cycle begins with operand selection.
i 35
o
UJ
J
C 1
u.
!
0.
H h
-j 53 3
a a
o
UJ cc UJ
N O
^ £o <-> < t >v-^ Q
z uj 2 X *J Z
= 1-O UJ tt CO <O
z UJ (-UJ K o _,- L
_l
>
UJ
_J
UJ
CM
IO
1 "\ co
tVJ
rH
1 i -
X
(X
' T j
1 •
^_ t—
u z
UJ UJ K
-1
UJ
CO
2 u.
Z X
o to>-KO
_»
J
_l
<s m
2 1-o
UJ II
t-
lilQ
o
K
UJ
O
K
m z cr.
a:2£
o tt
* t-
(E CON
UJ £ h °IM Q <-> UJ J
m
I
Q < ~j
<<f uj"- co
UJ —
a. o
o uj
IT
'
§°
I 1 i 1 1i 1
1 i
t-
z
Q. oX UJ
UJ
_,
. UJ
£ <"
\-
< H
' z
UJ O
0. si
2 o
Ul
1 1 '
UJ
c»( i
UJ
UJ
-J
1-
u
IM
< 1 . h tr.
z o
UJ UJ (-
1
»-
UJQ
z oO UJ
0.
_JX UJ
o 1 * UJ CO
<r
UJ CO1
IM CO
UJ
KQ
1
'
1.32) CU
AO1
X IE
UJ UJQ Q
(
1
Z
X 1
UJ
o B
<
iUJ t- z co «.« Z o i
^ 2 f-Z x9 CO
"*
UJ t-
t- - <J
Ui
O UJ
1
_i
<
_lH UJ
U. co
UJ
_l
oe (E
UJ UJ
O B. ao
-1
0.
K H K »-
_l u. -1
1
€
3
a
u
o
w
w
OJ
o
o
<U
XI
-p
o
bO
CCJ
•HQ
o
oH
PQ
H
I
OJ
0)
O
>-
CO g
ui a
K Ul8«
36
Two operands, one of which may come from memory, flow through the paths in the
logic selected "by the set of control signals. At the completion of a cycle,
result values are clocked into the registers specified by the set of control
signals.
In any logic design, options are available at many stages. The
rules governing the choice among options in this design can be qualitively
stated as follows: minimize cost and package count, but not at the expense
of time in the critical path. Cost is reflected not only in the direct cost
of the packages, but also by the amount of board area (and hence the number
of boards) which the packages occupy. Minimizing the number of boards can
lower overall cost by reducing the need for backplane wiring or mother boards
and eliminate the need for inter-board connections. The board area for a
package was assumed to be proportional to the number of pins which the package
has. Although this assumption is not strictly true , it serves well as an
operation rule of thumb when making design choices.
1+.2.1 Conventions Used in the Figures Which Describe Logic
Designing computer hardware in terms of existing integrated circuil
packages differs from computer design in terms of discrete components. In
many cases, the designer working with integrated circuits finds that no
existing package exactly suits the need of the moment. What he must then do
is
make the best compromise he can with the packages which are available, accord-
ing to the general guidelines which he has adopted.
The simplest example of the above general comment is that it often
happens that an N-input gate of some type is needed. A concrete example in
this design is that a four input OR gate is needed by the logical demands of
the function to be implemented. What are available are two input OR gates
37
and two, four, and five input NOR gates. Among these gates, only - the five
input NOR gate - is available in Schottky form. When the desired logic
function is in a time-critical path, the highest speed element should he used.
Hence, one finds himself using a five input gate for a four input function.
Many instances of such use occur in this design. When they occur in the
figures, only the number of inputs which are required for the logic function
being implemented are shown. The extra leads which may exist are assumed to
be connected to sources of logic ones or zeros as necessary. For example,
the extra input of the above five input NOR gate would have to be connected
to a constant logic zero source to guarantee the correct operation of the
logic in which it is used.
Detailed documentation for the integrated circuits used in this
design can be found in four industry data books. In the description which
follows, the following notation given in Table U.2.1-1 was used for naming
components.
Form of the Name Source for Detailed Information
SN7^+xxxx The TTL Data Book for Design Engineers, First Edition,
Document Number CC-Ull, Texas Instruments Incorporated,
1973.
Supplement to the TTL Data Book for Design Engineers,
First Edition, Document Number CC-U16. Texas Instru-
ments Incorporated, 197^ •
SIGxxxx Signetics Digital, Linear, MOS Data Book, Signetics
Corporation, 197^-
AMxxxx Advanced Micro Devices Data Book, Advanced Micro
Devices Incorporated, 197^-
•
NATxxxx Digital Integrated Circuits, National Semiconductor
Corporation, 197^.
Table U.2.1-1 The Notation for Package Names
in the Logic Design Figures
38
U.2.2 Signal Name Notation Used in the Design Description
In the description of the design in the follwoing sections, signals
will he named by an identifier of eight or less capital letters and digits.
The first character of a signal name will be a letter. Multi-bit signals are
named by a single identifier to which bit specifications are appended. A
bit specification is a list of up to three integers separated by commas and
enclosed in parentheses. The bits of multi-bit signals are numbered from one
for the most significant to N for the least significant bit of an R bit
signal. A bit specification which consists of a single integer specifies the
single bit of the multi-bit signal with that integer as its bit number. In a
bit specification with two integers, the first specifies the bit number of
the most significant bit of the signal and the second specifies the number of
contiguous bits in the signal. The third integer of a three integer bit
specification is the difference between successive bit numbers in the speci-
fied signal. Table h. 2.2-1 gives several examples of signal names.
Signal Name Meaning
A the one bit signal A
B(3) bit three of the multi-bit signal B
B(l,32) bits one through thirty-two of the multi-bit signal B
B(5,U) bits five through eight of the multi-bit signal B
C(l,2,U) bits one and five of the multi-bit signal C
Table U.2.2-1 Several Examples of Signal Rames
39
This notation for signal names is used consistantly throughout the text and
figures which describe the design. It is also used for signal names in the
input language for the logic simulation package described in section 5.1. In
the truth tables which follow, a lower case "x" signifies that the package
described by the truth table operate correctly for any value of the signal
represented by the "x".
U.2.3 Inversion in the Logic Figures
When the function of an integrated circuit includes the logical
complement of the inputs, this is shown by a small circle external to the
rectangle which represents the integrated circuit. The alignment shift blocks
of Figure U.2-1 are an example of an inverting block.
k.2.k Detailed Description of Two Packages
Two packages, the Texas Instruments SN7^S15T and the Signetics 8263,
are described in detail in this section. Two reasons motivate these detailed
descriptions. First, these packages are typical of most of the integrated
circuits which are used in this design. Second, and perhaps more important,
these particular packages perform critical functions in the design. All of
their features are exercised, so that a full understanding of the design is
impossible without a full understanding of these two packages.
U.2.U.1 The Texas Instruments SN7US157
The Texas Instruments SN7^S15T is a quadruple two-to-one selector.
It accepts two four bit input operands and a one bit selection signal and
produces a four bit output. The output is the four bit input designated by
the selection signal. There is one more input, however. A one bit strobe
signal can be used to force the outputs to zeros without regard to the input
ko
signals. There are several occasions in the design where the strobe signal is
used to good advantage. The truth table for the SN7US157 is given in
Table U.2.U.1-1.
Input s
Output
Data Selection Strobe
1
x 1 X X 1
A(l f U) A(1,U)
x B(l,U)
1
1
1
B(l,M
Table k.2.k.l-l The Truth Table for the
Texas Instruments SNT 1+S15T
U.2.U.2. The Signetics 8263
The Signetics 8263 is a quadruple three-to-one selector. It accepts
three four bit input operands, a two bit selection signal, and a one bit com-
plement signal, and produces a four bit output. The output is the four bit
input designated by the selection signal. The two bit selection signal can
specify one of four input signals; the fourth state is used to set the output
to zero without regard to any of the input signals. The complement signal
can be used to specify that the output is to be the logical complement of the
selected input. The truth table for the Signetics 8263 is given in
Table 1+.2.U.2-1.
in
Inputs
Data
•
Selection Complement
Output
X X X 00 0000
A(1,M X X 01 A(1,U)
X B(l,10 X 10 B(1,U)
X X c(i,U) 11 c(i,M
X
A(1,U)
X
X
X
X
B(1,U)
X
X
X
X
C(1,U)
00
01
10
11
1
1
1
1
mi
A(l,U)
B(1,U)
C(1,U)
Table U.2.1+.2-1 The Truth Tahle for the Signetics 8263
U2
1+.2.5 The Processor Design
In the two sections which follow, the design of the processor is
completely described. The first of these sections describes functional logic
blocks in their own right without regard to the contributions which those
blocks make in the operation of the processor. The second section describes
how the processor performs normalization, rounding, floating point addition/
subtraction, floating point double precision addition/subtraction, floating
point multiplication, and finally floating point division. This section
relies on an understanding of the former sections describing the various logic
blocks. It describes the control logic which is necessary to integrate the
operation of those logic blocks to perform the desired operations.
1+.2.5.1 Logic Blocks
The following sections describe several logic elements which per-
form definite functions in' support of larger operations in the processor.
h. 2. 5.1.1 The Zero Detect Logic
A zero detect logic block produces the logical OR of thirty-two bits,
Three instances of the zero detect block occur. In all three cases, the
thirty-two input bits constitute a thirty-two bit operand fraction. Figure
k. 2. 5.1.1-1 depicts the zero detect logic. The packages used are four SN7^S26(
dual five-input positive NOR gates and one SN7^S133 thirteen-input positive
NAND gate. Each of the NOR gates is used to produce the NOR of four input
fraction bits. The eight results are combined by the NAND gate to yield the
desired OR of the thirty-two input bits.
In Figure h. 2. 5. 1.1-1, the four bit groups shown as inputs to the
NOR gates represent four bit digits of a fraction. In only one of the three
X(29,4)
X(25,4)
X(21,4)
X(17,4)
X(13,4)
X(9,4)
X(5,4)
X(l,4)
SN74S260
SN74S260
SN74S133
Figure k. 2. 5. 1.1-1 The Zero Detect Logic
1+1+
instances of the zero detect logic is this rigid connection scheme necessary.
(See section k. 2. 5.2.1 Normalization.) In the other two cases, the total of
forty NOR gate inputs can be connected in whatever manner is convenient for
circuit board routing purposes.
h. 2. 5.1.2 The Fraction Comparator
This logic block is built entirely with the SN
r
f
1+S85 four bit
comparator. This integrated circuit accepts a pair of four bit operands and
three signals which permit fabrication of multi-bit comparators and produces
three one bit output signals. Figure U. 2. 5. 1.2-1 shows one SNT^S85, and
illustrates how it is used in this design. Table h. 2. 5-1.2-1 is the truth
table for the SNTUS85. Figure h. 2. 5-1.2-2 shows how eight SNT^S85's are
used to compare two thirty-two bit fraction values. The output signal AGTR is
a logic one if and only if the A(l,32) input signal exceeds the B(l,32) input
signal. The ABEQ signal is a logic one if and only if the input signal values
are identically equal.
k. 2. 5. 1.3 The Exponent Adder
The exponent adder, shown in Figure U. 2. 5. 1.3-1, accepts two eight
bit exponent quantities, AEXP(l,8) and BEXP(l,8), one three bit function
specification, ABFUNC(l,3), and a one bit input carry signal, EXCARRY. The
two eight bit exponent inputs consist of a zero bit as most significant
bit,
followed by the seven bits of the biased exponent for the two operands.
The exponent adder produces the eight bit combination of the two
input exponents, EXCl(l,8), as specified by the function, ABFUNC(l,3), the
absolute value of the difference of the two input exponents, ABS(l,7), and
two one bit control signals, EXC2 and EXC2BAR.
h5
NPUTS OUTPUTS
MOST
SIGNIFICANT
LEAST
SIGNIFICANT
Al A2 A3 A4
Figure k. 2. 5.1.2-1 The SN? 1+S85 Four Bit Comparator
U6
Relation of the
k "bit data inputs
Cascading Inputs Outputs
A = B A<B A > B A = B A < B A > B
A > B XXX 1
A < B XXX 1
A = B 1 X X 1
1 1
1 110 1Oil
Table h. 2. 5.1.2-1 The Truth Table of the SNT^S85 Four Bit Comparator
UT
B(29,4)
B(28)
A(28)
B(23)
A(23)
B(4,4)
A(4,4)
B(18)
A(18)
B(13)
A(13)
ABEQ
AGTR
Figure h. 2. 5. 1.2-2 The Fraction Comparator
BEXP(5,4)
BEXPU.4)-
(0,NSHIFT(1,3))-
AEXP15.4)-
m
i—
i
to
z
to
0100
AEXP(1,4)
m
to
<r
r-
z
to
U8
1 (CARRY)
00
to
<r
h-
z
to
FBAC4
00
to
z
en
EXBA(5,4)
CVJ
00
rH
to
r-
z
to
EXBA(1,4)
001
(FUNCTION)
EXCARRY
00
rO
to
<r
z
to
EXCH5.4)
ABG(2)
ABP(2)
FABC4
00
ro
to
z
to
ABG(l)
ABP(l)
00
to
N-
Z
to
ABFUNCU.3)
Figure U. 2. 5.1.3-1 The Exponent Adder
m
i-H
to
z
to
m
-H
to
f-
z
to
EXC2
ABS(4,4)
ABS(1,3)
EXCK5.4)
EXC2BAR
EXCH1.4)
h9
The main functional component of the exponent adder is the SN7^S38l
arithmetic-logic unit. The functions performed by the SN7^S38l, together with
the function codes which specify them, are shown in Table k. 2. 5.1.3-1 (Texas
Instruments Corporation, 197M • The SN7^S38l does not produce an output carry
signal. Instead, it produces the standard pair of carry look ahead singals
for the two four bit operands. One of these signals indicates whether the
two input operands will generate a carry; the other signal indicates whether
an input carry of one will be propagated (Ledley, i960). The generate and
propagate signals must be used in conjunction with a carry generator such as
SN7 1+Sl82 (Texas Instruments Corporation, 1973).
The exponent adder actually consists of two eight bit adders
working in parallel. The one shown at the top of Figure k. 2. 5-1.3-1 always
computes the difference A(l,8) - B(l,8). The lower adder computes the
function specified by the control unit signals ABFUNC(l,3) and EXCARRY. When
ABFimC(l,3)=010, and EXCARRY=1, ABS(l,7) is the absolute value of the exponent
difference and EXC2 and EXC2BAR have the meanings given in Table h. 2. 5.1.3-2.
The absolute value is computed by computing both A(l,8) - B(l,8) and B(l,8) -
A(l,8), and selecting the positive result with the pair of SN7^S157 two-to-
one selectors by using EXC2BAR as the selection signal.
k.2.5.1.k Shifting
Fraction alignment shifting and the normalization shifting are both
accomplished by using the Signetics 82ii3 eight bit position scaler (Signetics
Corporation, 197*+, pp. 3.28 through 3.32). This device has open collector
outputs so that several can be wire ANDed together. The shifted output bits
are the logic complements of their corresponding input bits. When disabled,
50
Input s
Output
A(1,M B(1,U) Function Carry
X X 000 X 0000
A(1,U) B(l,10 001 B(1,U) - A(1,U) - 1
A(1,U) B(1,U) 001 1 B(1,U) - A(1,U)
A(1,U) B(1,U) 010 A(1,U) - B(1,U) - 1
A(1,U) B(1,U) 010 1 A(1,U) - B(1,U)
A(l,lt) B(1,U) Oil A(1,U) + B(1,U)
A(l,U) B(1,U) Oil 1 A(1,U) + B(1,U) + 1
A(1,U) B(1,U) 100 X A(1,U)©B(1,U)
A(l,U) B(1,U) 101 X A(1,U) OR B(1,U)
A(1,U) B(1,U) 110 X A(1,U) AND B(1,U)
X X 111 X 1111
Table 14.2.5.1.3-1 Functions of the SNT^SSBl with
Active High Carry and Data
51
Signal Value Meaning
EXC2 A(l,8) > B(l,8)
1 A(l,8) < B(l,8)
EXC2BAR A(l,8) < B(l,8)
1 A(l,8) > B(l,8)
Table h. 2. 5.1. 3-2 The Meanings of EXC2 and EXC2BAR
52
the device emits logic ones. Output bits which, because of the specified
shift, have no corresponding input bits are also logic ones.
Because the exponent base of the floating point system used in this
design is sixteen, alignment and normalization shifting always require a shift
by a multiple of four bit positions. The alignment shift logic, Figure
4. 2. 5.1.4-1, and the normalization shift logic, Figure 4.2.5-1.4-2, can there-
fore be implemented by using only four SIG8243 ' s each. Each of the scalers
accepts one bit from the same position within each of the eight digits of the
thirty-two bit fraction to be shifted. The shift amount for each is the
number of digit positions to shift.
Although the SIG8243 has both an enable and an inhibit input to
control the output state, this design uses only the inhibit signal. When
the inhibit signal is a logic one, the output bits are all logic ones. Dis-
abled outputs are used to provide zero operands when the shift amount ex-
ceeds seven, and also for several other cases in the design where zero
operands are needed. The details of alignment shift control are given in
section 4.2. 5.2. 3 which discusses floating point addition and subtraction
Normalization shift control is discussed in section 4. 2. 5 .2.7 on double
precision addition and subtraction. When the inhibit signal is a logic
zero, shifting of the input bits takes place as specified by the three bit
shift select signal.
The device performs shifts in only one direction. Both left and
right shifts can be implemented by proper use of the scaler as shown in
Figure 4.2.5.1.4-1 and Figure 4. 2. 5. 1.4-2 by altering the orientation of the
device with respect to the most significant bit of the input signal.
53
A(4,8,4H
A(3,8,4) <
A(2,8,4) <
A(l,8,4) <
fA(29)
A(25)
A(21)
A (17)
A(13)
A(9)
A(5)
A(l)
10
II
12
16
17
K)
CVJ
00
CO
ro
CM
CD
O
»-»
CO
ASHIFTU.3)
00
01
J 02
13 £j 03
CO
14 e> 04
15 CO 05
06
07
LEFT(4,8,4)
> LEFT(3,8,4)
^LEFT(2,8,4)
LEFT(29)
LEFK25)
LEFT(21)
LEFTU7)
LEFTU4)
LEFT(9)
LEFK5)
LEFT(l)
^LEFT(1,8,4)
Figure U.2.5.1.U-1 The Alignment Shifter
5h
B(4,8,4) <
B(3,8,4) <
ro
CM
oo
o
CO
B(2,8,4) <
B(1,8,4H
ro
CM
CD
O
CO
^N0RM(4,8,4)
N0RM(3,8,4)
*N0RM(2,8,4)
N0RM(29)
NORM (25)
N0RM(21)
N0RM(17)
NORM (13)
NORM (9)
NORM (5)
NORM (1)
>N0RM(1,8,4)
NSHIFTU.3)
Figure 1+.2.5.1.U-2 The Normalization Shifter
55
h. 2. 5.1.5 The Left Operand Selection Logic
The left operand selector' logic block supplies the left operand to
the adder. Two different integrated circuits are used in the left operand
selector: the SNTUS157 quadruple two-to-one data selector and the SWT i+S153
dual four-to-one data selector. For clarity of description, the "blocks in
Figure U. 2. 5 .1.5-1 do not correspond to the above integrated circuit packages,
but rather to the selection functions they perform. They are labelled S15T
for the two-to-one function, and S153 for the four-to-one function. Whereas
the SN7**S153 operates on pairs of four bits, the S153 at the bottom of the
figure is shown operating on a single four bit group; the S153 next to the
bottom operates on ten four bit groups.
The left operand selector supplies six different operands. They are
1. the fraction output of the left alignment shift logic
2. the twelve high order bits of the first approximation to the
reciprocal for division. The other twenty bits of the fraction
are forced to one by disabling the left alignment shift logic.
As noted above, the alignment shift logic produces complemented
outputs, so that the adder operates on active low data. Thus,
the ROM which supplies the initial reciprocal approximation
must be programmed to supply active low data also.
3. the constant fraction one-half (in active low data form) for
use in the division algorithm. The high order bit, LEFT(l), is
forced to zero by the bottom S153 of Figure h. 2. 5.1. 5-1, and
the other thirty-one bits are forced to one by a disabled
alignment shift network.
56
m
cm
fO
LEFK26 8)
RIGHK25) m
CM
<
ro
m
—
"
—
"'
(
L FFT(24)
CM
ro
(V
0)
i
<
1
LEFT(13,11)
ALIGNMENT
SHIFT
rs a
ro
—
•
H
ro
in
tn
<
1 FFTfl?^
CM
LLr 1 VIC./
o—
DIVISION
XO
ROMS
( i
D
1
.._. i rrTf? mi
CM
L.C.I 1 Vi_, 1VJ1
y~
ro
in
0)
.. LFFTdl
-*
LEAST SIGNIFICANT
24 PRODUCT BITS
o—
Figure U.2.5.1.5-1 The Left Operand Selection Logic
57
-12
k. the constant fraction 2 " ' for use in the division algorithm.
The bit LEFT ( 12) is forced to zero by the corresponding S153,
and the other thirty-one bits are forced to one by a disabled
alignment shift network.
5. a value for rounding data values to memory length (twenty-four
fraction bits). All bits of this constant are ones from a
disabled alignment shift network, except for LEFT(2U), which
is equal to bit twenty-five of the fraction being rounded.
6. the twenty-four least significant bits of a product. The
adder normally operates on active low data, and a logic comple-
ment follows the adder. A product return in active high data
form. If the least significant part of the product is sought,
it is complemented by the adder by using the exclusive OR func-
tion with ones forming the disabled right alighnment shift
logic.
Since the logic for the left operand selector requires the S153
function on a total of thirteen bits and the S157 function on nineteen bits,
seven SN7^S153 and five SN7^-S157 integrated circuits are required to imple-
ment it. Wo control local to the processor is necessary for its operation.
k. 2. 5.1.6 The Adder
The adder, shown in Figure k. 2. 5 .1.6-1, accepts two thirty-two bit
fractions, LEFT (1,32) and RIGHT(l,32), a function specification, AFUNC(l,2^),
and an input carry AC. It produces a thirty-two bit output, SUM(l,32), which
depends on the input operands, the carry, and the function specification. The
SN7^S38l arithmetic-logic unit and the SN7^Sl82 look-ahead carry generator.
58
AC •
RIGHT(29,4)-
LEFT(29,4)
RIGHT(25,4)
LEFT(25,4)
RIGHT(21,4)
LEFT121.4)
RIGHTU7.4)
LEFT(l7,4)
RIGHT(13,4)
LEFT (13,4)
RIGHTO, 4)
LEFT(9,4)
RIGHT(5,4)
LEFT(5,4)
RIGHTU.4)
LEFT(1,4)
CO
z
CO
00
K)
CO
z
CO
00
CO
?
Z
CO
co
CO
«r
r-
z
CO
00
ro
co
Z
CO
00
K)
CO
z
CO
00
fO
CO
z
CO
00
ro
CO
z
CO
AGU4)
APU4)
SUM(29,4)
AC4L
AGL(3)
APL(3)
SUM(25,4)
AC8L
AGL(2)
APU2)
SUM121.4)
AC12L
AGL(l)
APL(l)
SUM(17,4)
AC16
AGHI4)
APH(4)
SUM113.4;
AC4H
AGH(3)
APH(3)
SUMO, 4)
AC8H
AGH(2)
APH(2)
SUMO, 4)
AC12H
AGH(l)
APH(l)
SUM(1,4:
CM
00
co
z
CO
CVJ
00
l-l
CO
z
CO
AG1
API
AG2
CVJ
CD
r-l
CO
z
CO
AP2
ACOUT
Figure h. 2. 5. 1.6-1 The Adder
59
'IIAFIIMTM ^) ... -
AFUNC(22,3)
00
K>
CO
z
(A
T~
r\
^
AFUNC(19,3)
A Fl iwn M ^> _ I—Mr UNtiu , jj
i-t
00
rO
CO
z
V)
T Fl IMP 171 i Y—
r\I r UNI/ \ ( 1 — '
) y
AFUNC(16,3)
L-
to
rO
CO
z
0)
TFIJISir (({) r-
r>>)
p tz -x
AFUNCU3.3)
i 00
ro
CO
z
CO
TFI IMP I ci\ x—
r\
.
)
'
^
AFUNC (10,3)
Z—
.
i CO
ro
CO
Z
CO
I FUMT IA) r~
r\
i )
«
\y
AFUNC(7,3)
L_
' CO
fO
CO
t
z
CO
rFUNC 1 1) r~7\
i
] .>— tz +/
AFUNC(4,3)
( CO
ro
CO
Z
CO
IFUNC ( 2) r~
^>
i
]
1
i izy
AFUNC (1,3)
l CO
ro
CO
z
CO
IFUMC n
)
^o 1tz.>
DFUNCU.3)
Figure U. 2. 5.1.6-2 The Logic for the Signal AFUNC (1,2*0
60
CUAFUNCO)
AFUNC113)
DFUNC(3)
- AFUNC(3)
CUAFUNC12)
AFUNCK2)
DFUNC(2)
- AFUNC(2)
CUAFUNC(l)
AFUNCl(l)
DFUNC(l)
IFUNC(l)
AFUNC(l)
Figure k. 2. 5.1.6-3 The Logic for the Signal AFUNC(l,3)
61
Except in the case of the integerize function, which is described
in section 4.2.5.2.6, each SN74S381 performs the same function, so that
AFUNC(l,3)=AFUNC(4,3)=.
.
.=AFUNC(22 ,3) . The functions which can be speci-
fied are listed in Table 4.2.5.1.3-1.
The output of the adder is the thirty-two bit result, SUM(l,32), and
the carry out, ACOUT. The function input to the SN74S38l's is the result of a
wire-OR of four separate tri-state sources. Figures 4.2.5.1.6-2 and
4.2.5.1.6-3 show successively more detail about these wire-ORed signals.
Figure 4.2.5.1.6-2 shows eight wire-OR 's, each of which produces a three
bit function specification. Each of these three bit wire-OR 's actually con-
sists of three separate wire-OR's like the three shown in Figure 4.2.5.1.6-3.
The details of the signals AFUNCl(l,3), IFUNC(l,8), and CUAFUNC(l,3) will be
given in sections 4.2.5.2.1 through 4.2.5.2.6.
4.2.5.1-7 Fraction Selection Logic
The adder operates on active low data primarily because the Signetics
8243 eight position scaler, which is used to perform alignment and normaliza-
tion shifting, has complemented outputs. Therefore, besides selecting one of
five possible fraction sources, the fraction selection logic also performs a
logical complement. The logic is shown in Figure 4.2.5.1.7-1, and consists
of Signetics 8263 quadruple three-to-one selectors and Advanced Micro Devices
AM9309 dual four-to-one selectors. The SIG8263's were used where possible to
reduce the package count, and the AM9309's were used because no other four-to-
one selector which provides complemented outputs is available.
62
FR0UTE(25,8)
SUM(25,8)
SUM(21,8)
FR0UTE(17,8)
SUM(17,8)
SUM(13,8)
STATUS(1,8)
FRACT(25,8)
FRACT(17,8)
FR0UTE(9,4)
SUM(9,4)
SUM(5,4)
(11,M0DE,C,Z.SIGN,0,U)
FROUTE(5,4)
SUM(5,4)
SUM(1,4)
FRACT(9,8)
FRACT(5,4)
FRACT(l,4)
Figure k. 2. 5. 1.7-1 The Fraction Selection Logic
63
The five signals which the fraction selection logic accepts as
input are
:
1. the unmodified output of the Adder, (SUM(l,32).
2. The output of the adder shifted right one digit position (four
bit positions) by appropriate selection. The control for de-
ciding between this input and input (l) above depends on
whether fraction overflow occurs during fraction addition. The
details of this control are given in section U. 2
.
5 • 2
.
3 . If the
shifted input is selected, the high order digit is forced to
1110, complemented to 0001.
3. The fraction output from the routing logic reassembly register,
FR0UTE(l,32). The routing logic is the subject of section
k.3.
h. The outputs of the mode flip-flop of section h. 2. 5.1.9 and
five condition flip-flops (MODE C, Z, SIGN, 0, U) which are
described in section h. 2. 5. 1.12, and the output of the status
register of the mode logic, STATUS(l,8), which is described in
section U. 2. 5.1.9. These thirteen bits are supplemented by
nineteen bits of ones (complemented to zeros) forced from the
SN7^S38l arithmetic-logic units (see Table U. 2. 5.1-3-1)
.
5. The special fraction overflow shift of one bit position which
uses the high order digit value of 0111, complemented to 1000.
This case is fully discussed in section U.2.5-2.5.
As shown in Figure h. 2. 5- 1.7-2, the fraction selection logic is in
every path which leads to the operand registers. Therefore, one would like it
6h
H
Z
UJ 3
X
h- O
z
2S >-
o 3
cr O
u. IT
1
1
>-
_J
m2oW —
CO ^
en o
< -»
UJ
cr
00
O f—
1
<
UJ oQ
~Z
s§
-9
00
t—
i
^-^
CO3
<
t- cr
to o
o
UJ
_J
cr
UJ
o
—
*
Ul
COCOfO
Q I—
«
z
< 2 o
Z> »-
<s>
<
cr
u.
CO
z a.
oo
—
_jH u.
Q '
zQ-
o j
Ou.
(0o
orz hi< H
cr co
l u
n o
o UJ
cr
u
o
co
co
CD
O
O
U
Pn
0)
X!
•P
o
-p
en
«
CL)
-P
O
-P
^
O
-P
CJ
CD
H
0)
CO
O
•H
-P
U
cd
*H
cu
.3
-p
O
ft
•H
,a
CO
a
o
•H
-P
a}H
0)
«
CD
Xi
EH
t—
rH
LT\
(M
-3-
CD
%
•H
P-4
65
26 nsec FROM
DATA TO OUTPUT
36 nsec FROM
SLECT TO OUTPUT
SIG8243
S
24 nsec FROM
DATA TO OUTPUT
32 nsec FROM
SELECT TO OUTPUT
AM9309
D rj
14 nsec FROM
DATA TO OUTPUT
20 nsec FROM
SELECT TO OUTPUT
SN74S153 SN74S153
— -
SN74S0-*
YY\(\YY
Figure k. 2. 5.1.7-3 A Faster Alternative to the Fraction Selection Logic
66
it to be as fast as possible. Unfortunately, neither the SIG8263 nor the
AM9309 is available in Schottky form. Figure U. 2. 5-1.7-3 shows how the
thirteen package logic of Figure 4.2.5.1-7-1 could be replaced by twenty-two
packages: sixteen SN7^S153 dual non-complementing four-to-one selectors and
six SN74S04 inverters. The gain in time is twelve nano-seconds per operation
when the timing depends on the data arrival time at the selectors, and sixteen
nano-seconds when the timing depends on the arrival time of the selection
signals.
h. 2. 5.1.8 Exponent Correction Adder
The exponent produced by the exponent adder is not correct in all
cases. When fraction overflow occurs, the fraction is shifted right one
digit position and the exponent must be increased by one. This case and
several others discussed in section k. 2. 5-2.5 are handled by the exponent
correction adder.
The logic for the exponent correction adder is shown in Figure
h. 2. 5.1.8-1. It includes two SIG8263 three-to-one selectors which are used
to select either the exponent of the left operand, AEXP(l,7), the exponent of
the right operand, BEXP(l,7), or the result exponent from the exponent adder,
EXl(2,7). Bit EXCl(2) is complemented because it is the bias bit in the
biased exponent. When an exponent sum or difference is computed by the ex-
ponent adder, the bias bit must be complemented in order for the resulting
exponent value to be correctly represented. (See section k. 2. 5- 1.12. k or
section U.2. 5.1.12. 5 for more details.) The logic which produces the selec-
tion signal for this selection is shown in Figure 4.2.5-1-8-2- The SN74S151
eight-to-one selector is controlled according to the truth table in
67
EXK2)
^SN74S04
(0, ,EX1(3,2)
(0,AEXP(1,3))
(0,BEXP(1,3))
C0RR0VFL
0110
100 •
EXK5.4)
AEXP(4,4)
BEXP(4,4)
SIG8263 SIG8263
0000
SN74S181
SN74S153
EX3TOK1.2)
0000
SN74S181 CORCARRY
C0RRFUNC(1,4)
EXR0UTE(1,3)
EXR0UTE(4,4)
SN74S153
EXP(1,4) EXP(5,4)
SELECT
Figure U.2.5-1.8-1 The Exponent Correction Adder
68
8 INPUT SIGNALS
• • •
SN74S151
EXC2BAR
EX3T0K1) EX3TOK2)
Figure U. 2. 5. 1.8-2 The Control Signal for Input Selection
for the Exponent Correction Adder
69
Table h. 2. 5.1. 8-2; its inputs are wired to the logic constants indicated by-
Table k. 2. 5.1.8-1. EXPl, EXP2, and EX3T0l(l) are control signals from the
control unit
.
EXC2BAR AZERO BZERO EXSEL
OUPTUT
1
1
1 1
1 1 1
1
1 1
1 1 1
1 1 1
Table U.2.5.1.8-1 The Low-order Bit of Exponent
Selection Control
An EXC2BAR value of one means that the left operand has been shifted, so
that the correct exponent for a sum or difference is the exponent of the right
operand. An AZERO value of zero means that the left operand fraction was
zero; a BZERO value of zero means that the right operand fraction was zero.
Control signals from the control unit determine the control signal for the
exponent selection process according to the truth table in Table k. 2. 5.1.8-2.
TO
Input Signals
EXSEL
Output Selection Signal
EX3T0l(l,2)
Exponent
Selected
X
1
01
10
11
exponent
adder
value
left
operand
exponent
right
operand
exponent
Table h. 2. 5. 1.8-2 Exponent Selection Control
The SN7Usi8l arithmetic-logic units are used to either add or subtract one
from the selected exponent. The values of CORCARRY and C0RRFUNC(l,U)
necessary to accomplish this are given in Table U.2.5-1.8-3 which is based
on the operating details of the SN7US181 (Texas Instruments Corporation, 1973,
p. 383).
Inputs
SN7US181 Output
C0RRFUNC(l,U) CORCARRY
0000
1111 1
exponent + 1
exponent - 1
Table U.2.5.1.8-3 Control of Exponent Correction Add
The control logic shown in Figure U.2.5.1.8-3 supplies the CORCARRY and
CORRFUNC(l,U) signals. The signal from the division control ROM is explained
in section U.2.5.2.5. The final stage of the exponent correction adder
71
FROM THE DIVISION
CONTROL ROM
CONTROL
CORCARRY
AND ALL
CORRFUNC
BITS
Figure U.2.5.1.8-3 The CORCARRY and CORRFUNC (l,U) Bits for
Exponent Correction Adder Control
72
performs a selection function for the result exponent similar to that per-
formed for the result fraction by the fraction selection logic described in
section k. 2. 5.1-7. The selection is performed by four SN7^S153 four-to-one
selectors according to the logic shown in Figure U.2.5.1.8-U and the truth
table given in Table U. 2. 5-1.8-4. The four final exponent values which can
be selected are:
1. The constant U6 ^, which is the correct biased exponent
value for the status register value.
2. The exponent of the value received from the routing unit.
3. The exponent selected by the input selection logic of the
exponent correction adder.
h. The above exponent modified by the SN7^Sl8l's of the
exponent correction adder. This last choice is governed
by the OVFLSEL bit whose derivation is explained in detail
in section 4.2-5-2.3.
Inputs
OVFLSEL
Selection
Signal
Exponent
Selected
control 1 control 2 control 3
1 X X 00 k6l6
X 01 rout ing
exponent
1 1 1 10 selected
exponent
1 1 11 modified
exponent
Table 4.2-5.1.8-4 Final Exponent Selection Control Signal
73
CONTROL 1
CONTROL 2
CONTROL 3
r-OVFLSEL
EX4T01L
Figure U.2.5.1.8-U Control Signal Logic for Final Exponent Selection
lh
h. 2. 5.1.9 The Mode Logic
The mode logic is shown in Figure k. 2. 5. 1.9-1- It includes the mode
flip-flop register (the SNTUS1T5) and an eight bit status register (the
AM933M . The contents of the mode register provides the most important local
control function in the processor. When the mode bit is zero, modification
of operand register and condition flip-flops (see section k. 2. 5.1.12) is not
permitted. The status register can be used to store mode register states.
Its use is illustrated in sections 6.U and 6.5-
The mode logic permits combining the current mode state with any-
one of fifteen bit values local to the processor or with one bit from the
control unit MODEIN. The selected bit can be combined with the mode bit
using any of the sixteen possible Boolean functions of two variables; the
SN7US181 can compute all of these Boolean functions. The resulting bit
value can be stored in the mode flip-flop and/or any one of the eight bit
positions of the status register. The status bits, STATUS(l,8), the mode
flip-flip state, and the condition flip-flop states can all be saved or re-
stored from a processor register (see section U.2 . 5.1.7)
•
The fifteen possible local operand bits for Boolean combination
with the mode bit include:
1. the eight processor status register bits, STATUS(l)
through STATUS (8)
2. the five condition flip-flop contents, C, Z, SIGN, 0, and U,
3. two combinations of conditions flip-flop contents, namely
a. ZBAR NANB SIGNBAR
B. OBAR NATO UBAR
75
FUNC299
BU7.B)
SN74S299 t—
t CLOCK 299
B(U)
MOOEIN
SN74S158
v o cr
MODESEL
ADDRM34U.3)
SEL150(1,4)
M0DEFUNCU.4)
MODEOUT
MODE
Figure U. 2. 5.1.9-1 The Mode Logic
76
The "bits of parts (2) and (3) above permit testing for any of the six possible
relations between two numerical values as shown in Table U. 2. 5.1.9-1.
Equal
Not equal
Greater than or equal
Less than or equal
Greater than
Less than
» Bit Comments
Z A result
zero
fraction was
ZBAR A result
not zero
fraction was
L SIGNBAR A result
positive
sign was
SIGNBAR NAND ZBAR A result was positive
* SIGN OR Z or zero
SIGNBAR AND ZBAR Complement of the above
by appropriate SNT^SlSl
Boolean function selec- !
tion
SIGN A result
negative
sign was
Table ^.2.5.1.9-1 Testing for Any Possible Relation
Between Arithmetic Values
The SNTUS299 is an eight bit parallel-in parallel-out shift register
which can operate at rates up to 50 MHz. It can shift left and right and has
a serial bit output. A subset of its facilities is used. Signal FUNC299 is
used to select either the parallel load or shift function. It receives
eight bits from the processor registers for restoration to the AM933 1* status
register.
77
The mode logic can accomplish its operations is significantly less
time than can the full processor. If it is desired, this fact can be used to
advantage by permitting the control unit to use several different inter-clock
pulse intervals for array control. Mode operations, and in particular
the serial shift of the eight bits from the SN7^S299 to the AM9334, are among
the best candidates for this approach.
The status bits, STATUS ( 1,8), can be saved in a processor register
with an assigned exponent value of k6 , (a biased exponent of plus six) by
appropriate use of the fraction selector, section U. 2. 5. 1.7, and the final
exponent selection part of the exponent correction adder, section 4.2.5.1.8.
The fraction selection logic complements its input; there, an inverting two-
to-one selector (the SN7^S158) is used to reinvert the data.
The AM9334 is an eight bit latch which accepts one input bit and a
three bit latch address, ADDRM3Ml,3). It stores the input bit in the
addressed latch when an input enable signal goes to a logic zero. (See Ad-
vanced Micro Devices Incorporated, 197 1*
, pp. 2-lU9 through 2-15^.)
The SN74S150 is an inverting sixteen-to-one selector, controlled by
SEL150(l,4). It provides one input to an SN7^Sl8l arithmetic-logic unit
which operates in logic mode. The other input to the SN7l+Sl8l is the current
Mode value. Any of the sixteen possible Boolean combinations of two variables
can be specified by MODEFOTC ( 1
,
k ) . (See Texas Instruments Incorporated, 1973,
pp. 382-391,)
The SN7^S175 is a quadruple flip-flop package which has both MODE
and MODEBAR outputs available.
78
k. 2. 5. 1.10 The Operand Registers
Although memory values have only thirty-two bits, intermediate
results within the processor have forty hits. The extra eight hits extended
the fraction to thirty-two hits within the r processor. Each processor has
sixteen operand registers. They are implemented by using SN7US172 register
files. The SN7^S172 stores sixteen bits organized as eight two bit words.
Figure h. 2. 5.1.10-1 illustrates how two SN7HS172 packages are used in this
design x to form a sixteen word file of two bit words . Twenty such combina-
tions, or a total of forty SN7*+S172 packages, are required to implement the
sixteen forty bit registers of the processor. The top SN7 1+S172 package of
each pair is used to store zero through seven, and the bottom packages store
words eight through sixteen.
The SN7*iS172 permits two data words to be read and two data words
to be written simultaneously. However, only three addresses are permitted.
One address specifies a word to be read, another specifies a word to be
written, and the third specifies a word to be read and/or written. The outputs
are tri-state; two enabling signals control the two read ports. Two more
enabling signals control the two write ports. When a given enabling signal
is a logic zero, the port to which it corresponds is permitted to function.
A four bit address is required to select one of sixteen words.
Three four bit addresses and four control signals are used to control the
registers. The three low order bits of each address are sent to the proper
port of each of the forty SN7 1iS172 packages. The high order bits of AADDRESS
and BADDRESS are combined with two of the control signals to form the selec-
tion inputs of a pair of SN7US153 four-to-one selectors for each enable signal.
79
C WRITE
ENABLES
B READ
ENABLES
A READ
ENABLES
INPUT
DATA
A WRITE
ENABLES
CM
r-t
CO
<t
z
CO
CM
CO
z
CO
m- B OPERAND(TWO BITS)
& A OPERAND
(TWO BITS)
AADDRESSU.3)
CADDRESS(1,3)
BADDRESS(1,3)
Figure k. 2. 5- 1.10-1 Sixteen Two Bit Words Implemented with SN7US172
Register Files
8o
One enable signal of each pair controls registers zero through seven, the
other registers eight through sixteen. The truth tables for the read enable
signals are given in Table U .2. 5 • 1.10-1.
SELECTION BITS ENABLE SIGNALS
High Order
Address Bit
I
......
. —
Registers Registers
Control Bit Zero through Eight through
Seven Sixteen
1
1
i
l
l
O
H
O
H
1 1
i
1
i
o
1 1
o ! 1 !
Table k. 2. 5- 1.10-1 Truth Table for the Read Enable Signals
The high order bits of AADRESS and CADRESS are combined with the
other two control signals to yield the selection signals for two more pairs
of SN7^S153 four-to-one selectors. These two pairs of selectors supply the
A and C write enable signals. The truth table for these selectors is also
given by Table ^.2.5.1.10-1, except that the zero logic input is supplied by
the MODEBAR output of the MODE flip-flop in each processor. This prohibits
any writing into registers of disabled processors. A clock pulse is required
to clock input signals into the SNT^S1T2 through an enabled write port.
k. 2. 5.1.11 The Index Adder
We saw in section 3 that address indexing capability within the
processors is an important capability in an array processor. Figure h. 2. 5-1 .11-1
shows the logic of the index adder which computes a sixteen bit effective ad-
dress, EADDRE(l,l6) , within each processor. The adder is implemented with
CUADDRU3.4)
A(21,4)
CUADDR(9,4)
A (17,4)
CUADDR(5,4)
A(13,4)
CUADDR(1,4]
A(9,4)
IXMODE i
81
r— IXCARRY
to
z
CO
CO
i-l
CO
z
to
CO
CO
z
CO
CO
.H
CO
<t
z
CO
IXGI4)
IXP(4)
EADDRU3.4)
IXC4
IXGI3)
IXP(3)
EADDR(9,4)
IXC8
IXG(2)
IXP(2)
EADDR(5,4)
IXC12
IXG(l)
IXP(i;
EADDR(1,4)
IXFUNC(1,4)
Figure h. 2. 5.1.11-1 The Index Adder Logic
CVJ
GD
t—t
if)
C/>
82
SNTUS181 arithmetic-logic units augmented with an SNT^Sl82 look-ahead carry
generator. It is controlled by a function input, IXFUNC ( 1 , 1+ ) , and a carry in-
put, IXCARRY, from the control unit. The address from the control unit,
CUADDR(l,l6), is combined with A(9,l6) by the adder. The "A" bits, which come
from the operand registers, are the low order sixteen bits of a twenty-four
bit memory-length fraction. A twos-complement integer can be produced for use
in indexing from a floating point value by performing an unnormalized addition
with the value with fraction 80000000l6
and biased exponent hG^. Two examples
of this operation are given in Table k. 2. 5. 1.11-1.
Initial Operands Aligned Operands
U6 80000000
1+1 10000000
1+6 80000000
-Ul 10000000
U6 80000000
i+6 oooooioo
k6 80000000
-1+6 OOOOOIOO
Sum
1+6 80000100
1+6 TFFFFFOO
Table 1+ . 2 . 5 . 1 .11-1 Two Examples of Processor Index Value
Computation
The hexidecimal digits which are underlined in the Sum column of
Table 1+ . 2. 5 . 1.11-1 are the part of the "A" operand which is one of the inputs
to the index adder.
Indexing of centrally supplied addresses might also be performed by
the main adder of the processor. To accomplish this, the control unit supplied
address value must be gated to the adder. The least costly way to provide this
gating is to replace four of the quadruple two-to-one selectors in the right
operand selection logic of Figure 1+.2-1 with eight dual four-to-one selectors.
83
This results in a net package count increase of four packages. The logic
described here requires four packages if ripple carry operation is used with
the SN7US181 arithmetic-logic units, and five packages - as shown in
Figure 2.4.5.1.11-1 if carry look ahead operation is used. Even the ripple
carry scheme is faster than requiring the operands and the result to pass
through the alignment shifters and fraction selector which use of the main
adder requires
.
4.2.5.1.12 The Condition Flip-flops
This set of sections describes the five flip-flop which hold infor-
mation about the results of operations in the processor. The state of each of
these flip-flops is protected from being changed when the processor is disabled
by having its mode value equal to zero. This control is provided by using the
lower of the two CLOCK gating methods of Figure 4. 2. 5 .1.12-1. These gates are
not shown in the subsequent figures which illustrate the individual flip-flops.
A control signal unique to each and the MODE value are used to produce a mode
controlled clock pulse for each of the condition flip-flops.
All of the condition flip-flops are implemented with one-half of an
SN7US7I+ dual flip-flop package. Both the true and complemented states are
supplied for use by this package.
4.2.5.1.12.1 The Carry Flip-flop
Figure 4.2.5.1.12.1-1 shows the carry flip-flop and its associated
control logic. Its state can be stored in a processor register (see section
4.2.5.1.7), and can be restored from a processor register by selecting the
path which includes B(l2). The carry out of the adder, ACOUT, can be used to
set the state of the carry flip-flop, or it can be ORed with the previous
8U
CLOCK
* "NCONTROL LINE SELECTED
SN74SI1 r CLOCK PULSE
CLOCK -
CLOCK
CONTROL LINE
MODE
CLOCK
*J MODE CONTROLLEDSN74Sliy CLOCK PULSE
Figure 1*. 2. 5 .1.12-1 CLOCK Selection Logic
85
CONTROL
l—ACOUT
CONTROL —
.
CONTROL
I— B(12)
CLOCK
CBAR
Figure k. 2. 5. 1.12. 1-1 The Cary Flip-flop Logic
state by using the appropriate control signal values.
h. 2. 5.1.12.2 The Zero Flip-flop
Figure U. 2. 5.1.12.2-1 shows the zero flip-flop and its associated
control logic. Its state can be stored in a processor register (see
section
k. 2. 5.1.7), and can be restored from a processor register by selecting
the
path which includes B(l3). The primary input to the zero flip-flop is the
output of a zero detect block (see section k. 2. 5-1.1) which operates on the
output of the fraction selection logic (of section 1+. 2. 5-1-7)- Previous
states can be ORed or ANDed with a current state by using the
appropriate
signal values.
k . 2
.
5 . 1 . 12 . 3 The Sign Flip-flop
Figure U.2. 5.1.12. 3-1 shows the flip-flop and its associated control
logic. Its state can be stored in a processor register (see
section k. 2. 5.1-71
and can be restored from a processor register by using the proper
selection
signals for the SN7HS151 eight-to-one selector and the SN7^S153
four-to-one
selector shown in the figure. The control logic permits the
sign flip-flop
to be set to any of the values listed in Table h. 2. 5.1-12-3-1-
SIGNAL I MEANING
^^ A state presumably previously stored in a processor
register
* & ^ The exclusive OR of the operand signs
1 Sb7oIu) wire-OR AFUNC(U> The sign of a sum of difference (see section
j
eXpa(i) | The sign of the left operand
EXPB(l) The si Sn of tne rieht operand
i RTESIGN I The sign of an operand from the
routing unit
' A forced positive sign; absolute value
1 | A forced negative sign; minus the
absolute value
Table U.2. 5-1.12. 3-1 Possible Signs for a Result
87
FRACTU.32)
ZERO DETECT
LOGIC
ZFFINBAR
CONTROL V^SN74S04
CONTROL
CONTROL
W KJ o
B(13)
— CONTROL
\^>
SN74H52
i
J
SN74S74
CLOCK
Z ZBAR
Figure 1*. 2. 5- 1.12. 2-1 The Zero Flip-flop Logic
88
S670(4) 1 ,— AFUNCK4)
SIGN
B114)
SELECTION
SIGNAL
SELECTION
SIGNAL
W
(SIGN OF THE ± RESULT)
-EXPA(l) (SIGN OF A)
1 EXPB(l) (SIGN OF B)
RTESIGN (SIGN FROM
ROUTING UNITS!
SN74S151
—
o
^SN74S04
N74S153
ZFFINBAR
1SN74S02
SN74S74
SIGN
CLOCK
SIGNBAR s SIGN
'
Figure k. 2. 5.1.12. 3-1 The Sign Logic and the Sign Flip-flop
89
The complement of any of the first six signs show in Table k. 2. 5 .1.12. 3-1.
The NOR gate between the SN7^S153 and the flip-flop uses signal ZFFINBAR of
the zero flip-flop logic (see section k. 2. 5. 1.12. 3 ) to insure that the sign
of a zero result is always a logic zero, or a positive sign. The NOR gate
is used together with appropriate selection by the SN7US153 since no Schottky
AND gate is available.
k.2. 5.1.12. h The Overflow Flip-flop
Figure k.2. 5.1.12.U-1 shows the overflow flip-flop and its associated
control logic. Its state can be stored in a processor register (see section
k.2. 5.1.7), and can be restored from a processor register by selecting the
path which includes B(l5).
In this design, an overflow condition exists when:
1. an exponent value which exceeds sixty-three is computed. This can occur
in the Exponent Adder during the computation of the result exponent for
multiplication or division; the signal EXO, described by the truth table
in Table k. 2. 5.1.12.4-1, is a logic one for this case. Fraction overflow
necessitates increasing the exponent by one in the exponent correction
adder; signals CORROVFL and EXP(7) cover this case.
2. a division by a zero fraction is attempted. The AZERO signal form the
zero detect logic for the left operand fraction covers this case.
3. an attempt is made to integerize a floating point value whose integer
part requires more than six hexidecimal digits. Signal INTRUNC , derived
by the logic of Figure k .2. 5.1. 12. k-2 covers this case.
A biased exponent with value V is represented by an exponent field
value of 6k+V. The sum of two exponents is:
90
CONTROL
AZERO
^SN74S04
CONTROL
CLOCK
OBAR
Figure k. 2. 5.1.12.^-1 The Overflow Flip-flop Logic
91
6k + vi
6k + V2
128 + VI + V2 = 128 + V = S
An overflow occurs when 6k <_ V <_ 63 + 63 = 126, or when
192 £ S <_ 25k (l)
A correct exponent results when -6k <_ V <_ 63, or when
6U <_ S <_ 191. (2)
Expressed in binary form, the above conditions are:
( 1
)
llxxxxxx
(2) 01xxxxxx(-6U) or 10xxxxxx(63)
•
The difference of two exponents is:
6k + VI
-{6k + V2)
VI - V2 = V
An overflow occurs when 6k <_ V £ 63 - (-6k) = 127. (3)
A correct exponent results when -6k <_ V <_ 63. (1+)
Expressed in binary form, the above conditions are:
( 3
)
Olxxxxxx
(k) llxxxxxx {-6k) or OOxxxxxx(63).
Conditions (l) through (k) can be implemented using an SN7US151
eight-to-one selector with the two high order bits of the result exponent and
the exclusive OR of ABFUNC(2) and ABFUNC(3) bit selection code. Table
.2.5.1.12.^-1 gives the truth table for this function.
92
ABFUNC(2) COR ABFUKC(3i
implies subtraction
EXC1(1) EXC1(2
1
1
1
1
1
1
1
1
1
1
1
1
EXO
1
x
x
1
Table k.2. 5.1.12.U-1 The Truth Table for Exponent Overflow
Signal EXO
For both exponent addition and substraction, the straightforward
arithmetic steps uniformly result in a bias bit which is incorrect. A cor-
rect biased result is produced when the bit in the bias position of the re-
sult is complemented after the arithmetic result has been computed.
During exponent correction, either one or zero is added to the
component. The only way overflow can occur is that one is added to the biased
exponent representation for an exponent of 63:
(6k + 63) + 1 = 128.
This has the binary form 10000000; in no other case does the result exponent
have q high order one. Hence, the correct signal for overflow detection
during
exponent correction is EXP(l), the high order bit of the eight bit sum.
4.2.5.1.12.5 The Underflow Flip-flop
Figure k. 2. 5.1.12. 5-1 shows the underflow flip-flop and its associated
control logic. Its state can be stored in a processor register (see section
U. 2. 5.1.7), and can be restored from a processor register by selecting the
EXU
CONTROL
—
i
CONTROL
93
EXP(l)
i
— CONTROL
— B(16)
i—
CONTROL
SN74H52
iSN74S74 CLOCK
U UBAR
Figure h. 2. 5-1.12. 5-1 The Underflow Flip-flop Logic
9U
path which includes B(l6).
In this design, operand underflow occurs only when a result
exponent which is less than -6U is computed. This can occur:
1. in the exponent adder during the computation of the result exponent
for a multiplication or division; the signal EXU, described by the truth
table in Table U. 2. 5-1.12. 5-1, is a logic one for this case.
2. when the value one is subtracted from an exponent value of -6U in the
exponent correction adder. This occurs only during some division steps
(see section U.2.5'.2.5). For this case, the initial biased exponent
value is 00000000, and the result, 11111111, is the only case for which
the high order result exponent bit, EXP(l), is a logic one.
A biased exponent with the value V is represented by an exponent
field value of 6U+V . The sum of two such exponents is:
6U + VI
6U + V2
128 + VI + V2 = 128 + V = S
A underflow occurs when -128 <_ V <_ -65, or when
<_ S <_ 63. (l )
A correct exponent results when -6U <_ V <_ 63, or when
6k <_ S <_ 127. (2)
Expressed in binary form, the above conditions are:
(1) OOxxxxxxx
(2) Olxxxxxxx or lOxxxxxxx.
95
The difference of two exponents is:
6k + VI
-(6k + V2)
VI - V2 = V
An underflow occurs when V
_
-65. (3)
A correct exponent result when -6h
_
V 63.
Expressed in binary form, the above conditions are:
(3) lOxxxxxx
(h) llxxxxxx(-6U) or 00xxxxxxx(63)
.
Conditions (l) through (k) can be implemented using an SN7US151 eight-to-one
selector with the two high order bits of the result exponent and the exclusive
OR of ABFUNC(2) and ABFUNC(3) (see section U.2.5.1.3) as the three bit selec-
tion code. Table h. 2. 5-1.12. 5-1 gives the truth table for this function.
' ABFUNC(2) XOR ABFUNC(3)
i
implies subtraction
EXC1 EXC2
T
1
EXU
1 X
1 1
1 1
1 1
1 1
1 1
1 1 1 X
Table k. 2. 5 .1 .12. 5-1 The Truth Table for the Exponent Under-
flow Bit
For both exponent addition and subtraction, the straightforward
arithmetic steps uniformly result in a bias bit which is incorrect. A cor-
rect biased result is produced when the bit in the bias position is complemented
after the arithmetic result is computed.
96
During exponent correction, either one or zero is added to the
exponent. The only way overflow can occur is for one to be added to the
biased exponent representation for an exponent of 63:
(6k + 63) + 1 = 128.
This has the binary form 10000000; in no other case does the result exponent
have a high order one. Hence, the correct signal for overflow detection
during exponent correction is EXP(l), the high order bit of the eight bit sum.
U.2.5..2 Processor Function
The previous group of sections described several logic blocks in
their own right without too much regard for their functions in support of
processor operations. This set of sections describes how the logic blocks
are integrated together to perform the high level operations. The details of
the control signals and gating is given in these sections.
k. 2. 5.2.1 Normalization
A normalized floating point number in this design has a non-zero
hexidecimal (four bit) digit as the leftmost digit of its fraction, unless
the entire fraction is zero. The normalization process accepts an arbitrary
floating point number and produces a normalized number with the same arithmetic
value. A floating point zero is unchanged; a number whose fraction has a non-
zero leftmost hexidecimal digit is unchanged. The fractions of all other
floating point numbers are normalized by a left shift which makes the left-
most fraction digit non-zero and introduces zero digits on the right for the
zero digits shifted off the left. The exponent of the numbers so adjusted
is reduced by one for each zero digit shifted off.
Figure k. 2. 5.2.1-1 shows the control logic which computes the shift
97
FROM
DIVISION
ROM
DSHIFT(1,4) BTEST(1,8)
SN74148
WW NSHU.3)
SN74S257 TRI-STATEENABLE
NSHIFT(1,4)
Figure k. 2. 5.2.1-1 The Leading Zero Detection Logic
98
o
-p
o
<u
^
-p
M
o
Cm
O
•H
bD
^-^ O
^- J
in O
Q. •H
-PX a
< H
CO
(U
.3
EH
CM
H
CM
UA
„ .
CM
<fr
-=t
r-t
a.
X bO
UJ •HPn
99
amount for the normalize shift logic. The signal BTEST(l,8) comes from the
SN7 1+S260 gates of the zero detect logic for the right operand (see Figure U.2-1
and Figure k. 2. 5.1.1-1). BTEST(i) is a logic one if digit "i" of the left
operand fraction is zero, numbering the digits from left to right. The
SN741U8 eight-line-to-three-line priority encoder accepts an eight bit input
signal and produces a three bit output signal which is a count of the number
of high order ones which occur in the input signal. The value seven is re-
turned for input signals of all ones, which is the case for numbers with zero
fractions.
During ordinary normalization, the output of the SN74l48 is the left
shift amount and also the number that must be subtracted from the exponent.
It is selected by appropriate control by the SN7^S157 two-to-one selector.
NSHIFT(2,3) is sent to the normalize shift logic, and NSHIFT(l,U) goes to the
selection logic for the exponent adder shown in Figure 4.2. 5.2.1-2. This
logic selects the "A" exponent for the exponent adder. Normally, it selects
the exponent of "A" from the operand registers. For normalization, the
operand (0100, NSHIFT(l.U)) is selected. Control signals enable the path
for ZFFINBAR, the output of the zero detect logic for the result fraction, to
the strobe input of the SN74S157 of Figure U. 2. 5-2.1-2. When the fraction in
question is zero, the output of the SNTUS6U is one, so that the SN4S157 selec-
tor is disabled and supplies zeros rather than NSHIFT(l,U).
Although a shift of seven places is the largest that occurs during
normalization, there are cases during double precision addition/subtraction
when a value of up to twelve must be subtracted from the exponent. For these
cases, a four bit NSHIFT value is provided. See section k. 2. 5.2.7 for details.
100
The CSHIFT (1,1+) signal is supplied by the control unit during
multiplication and division "by a power of two operations. See section
U. 2. 5.2.10 for the details of this operation.
14.2.5.2.2 Rounding
The fraction size of memory words and multiplier operands is
twenty-four bits, and that of processor words is thirty-two bits. A rounding
operation is included in the design to permit rounding a thirty-two bit
processor fraction to a twenty-four bit memory and multiplier length fraction.
The rounding is accomplished by adding one in bit position twenty-four of the
fraction when position twenty-five is a one. The fraction passes through
the logic as the right operand. Bit twenty-five of that fraction is selected
by the left operand selector as bit twenty-four of a fraction that is zero
in every other bit position (see section h. 2.5.1.5). The other bit positions
are forced to zeros by disabling the left alignment shift network. The
exponent of the result is that of the right operand, selected by control sig-
nals to the exponent selection part of the exponent correction adder (see
section k. 2. 5.1.8). The two fractions are added by the adder under control
unit control, using CUAFUNC(l,3) for function specification (see section U. 2. 5-1.6
Fraction overflow and the corresponding exponent adjustment by the exponent
correction adder can occur. The sign of the result is the sign of the right
operand.
U.2.5.2.3 Floating Point Addition
A floating point value in this design is represented by a sign bit,
a non-negative proper fraction and an integer power of sixteen. The fraction
parts cannot be correctly added until they are adjusted for the difference in
101
their exponents. In this design, this adjustment is made by shifting the
fraction whose exponent is smaller right by the number of digit positions by
which the exponents differ. The process, described in terms of Figure U.2-J.,
proceeds as follows:
The exponent difference is computed by the exponent adder. The
difference, together with a pair of one bit signals which each indicate
whether one of the operand fractions is zero, is used by the pre-align control
logic to specify which of the operands is to be shifted right. At least one
of the alignment shift logic blocks performs a shift of zero places during
each floating point addition. The other alignment shift logic is disabled
when the shift amount exceeds seven. The pre-align control logic also selects
the exponent of the result.
The correctly aligned fractions proceed through the operand selectors,
adder, and fraction selector to the operand registers. The result of this
processing cycle is an un-normalized floating point sum or difference with a
correct exponent. If a normalized result is sought, another cycle is used.
The fraction passes through the leading zero detection logic of Figure
k. 2. 5.2.1-1, which determines the left shift amount required for normalization.
This shift amount is used by normalization shift logic to perform the fraction
shift, and by the exponent adder to compute the correct exponent for the
normalized result.
The addition process is complicated by the fact that sign-magnitude
representation is used for floating point values in this design. The actual
operation which the adder must perform depends not only on the instruction
being executed, but also on the signs and the relative magnitudes of the
102
operands "being processed. If one of the operands is zero, the result is the
other operand. If two operands with equal exponents are to be added, the
actual operation performed by the adder depends on their signs. When the
signs are the same, the adder must add the two magnitudes; the result sign is
that shared by the two operands. However, when the signs differ, the smaller
magnitude must be subtracted from the larger, and the sign of the result is
that of the larger operand. The SNTUS381 arithmetic-logic unit is ideally
suited to these circumstances, because it can perform the A+B, A-B, and B-A
operations (see Table k. 2. 5.1.3-1)
.
When the argument exponents differ, the operand with the larger
exponent is the larger in absolute value without regard to the fraction values'
involved. Hence, an exponent comparison is also required to determine what
SNTi+S38l operation to perform. Table U.2.5-2.3-1 summerizes the ten input
signals which are required to determine the operation which is performed by
the SNTUS381 arithmetic-logic units of the adder. Figure 14.2.5-2.3-1 shows
the logic which implements Table U.2.5-2.3-1. During floating point addition
and subtraction, the wire OR network of Figures U. 2. 5-1-6-2 and k. 2. 5-1.6-3
makes AFUNC the same as AFUNC1 by appropriate enabling of the tri-state signals
The ABEXEQ signal is derived by the logic of Figure k. 2. 5-2.3-2. When the
absolute value of the exponent difference is zero, the exponents are equal,
and ABEXEQ is a logic zero.
A fraction overflow can occur only when the function performed by
the SNTUS381 arithmetic-logic units of the adder is A+B. The signal OVFLSEL
is implemented by an SNTUS151 eight-to-one selector which uses AFUNC(l,3),
the SNTUS381 function specification, as its selection signal. The input to
103
o O Q GD < CD
CL (T CM o 3 Z 7 tr CJf
QJ UJ O < CO o O h- UJM M X ID 3 l-H »—i o m<CDUJCJOCOCO<<
SIG8205 512 x 8
ROM
ABEXEQ SN74S257
AFUNCK1.3)
TRI-STATE
ENABLE
RESULT
SIGN
Figure 1+.2.5.2.3-1 The Logic which Selects the Adder Function
During Addition and Subtraction
ioU
ABS(7)
ABS(6)
ABS(5)
ABS(4)
ABS(3)
ABS(2)
ABS(l)
SN74S02
ABEXEQ
SN74S20
Figure U. 2. 5.2. 3-2 The Logic for the ABEXEQ Signal
105
the SNTUS151 is a logic one in every position except that which corresponds
to AFUNC(l,3)=011; for the latter case, the selector input is ACOUT, the high
order carry out of the adder. A logic zero value for OVFLSEL thus indicates
a fraction overflow. The OVFLSEL signal is used by both the fraction selec-
tion and the exponent correction logic.
Signal
ABEQEQ
AZEEO
BZERO
EXC2
CUADD
CUSUB
Value
SIGNA
SIGNB
AGTR
ABEQ
1
Meaning
1
The two operand exponents have the same value.
The left operand (A) fraction is zero.
The right operand (B) fraction is zero.
The exponent of the right operand exceed that of
the left operand.
The operation specified is addition.
When CUADD is zero, subtract the right operand
from the left; that is B-A.
When CUADD is zero, subtract the left operand
from the right; that is A-B.
The left operand is greater than or equal to zero.
The right operand is greater than or equal to zero
The unshifted left fraction exceeds the unshifted
right fraction.
The unshifted fractions are equal.
Table 4.2.5-2.3-1 The Input Signals for the Adder Function Logic
Table 4.2.5.1.3-1 which lists the functions and function codes for
the SN7US381 arithmetic-logic unit of the adder indicates that the carry into
the adder depends on the function code. The logic of Figure 14.2.5.2.3-3 shows
how the carry into the adder is determined. Since the adder operates with
CONTROL
CUAC —
I
106
W
V
AFUNC(2)
— AFUNC(3)
. CONTROL
SN74H52
AC
Figure U.2.5.2.3-3 The Logic for the Carry into the Adder
CONTROL
—
|
±SN74S04
1^» SHIFT < 8
CONTROL —
i
r-DLLT8
i— CONTROL
r— CONTROL
EBSH
-DRLT8
CONTROL
ELAS ERAS
EXC2
SHZERO
AZERO
BZERO
ASHIFTU.3)
EXC2BAR
{ SN74S20
1=>SELECT
ZERO
BSHIFT(1,3)
TRI-STATE
ENABLE
Figure k. 2. 5.2. 3-k The Alignment Shift Conrtol Logic
108
active low data, a one carry in is required for addition and a zero for sub-
traction. The control unit can specify the carry by using control signal CUAC.
When the adder function is determined in the processor by the logic of Figure
U.2.5.2.3-1, the path which uses added function bits produces the correct
carry in. The carry flip-flop output, C, is used as the carry in to the adder
during double precision operations.
The logic which controls alignment shifting during floating point
addition and subtraction is shown in Figure U.2.5.2.3-U. The signal ELAS is
the enabling signal for the left alignment shift logic, and ERAS is that for
the right alignment shift logic. The signals EASH and EBSH permit control
unit specification of the shift enables without regard to local conditions.
The two signals DLLT8 and DRLT8 come from the double precision control ROM,
and the signal S is derived from the logic of Figure k. 2. 5-2.7-3. Bits one
through four of the absolute exponent difference, ABS(l,M, are combined by
an SNTUS260 NOR gate to yield a signal which is a logic one when the alignment
shift amount is less than eight. The actual shift amount is either ABS(5,3)
or zero under the control of a pair of shift selection signals which uses
AZERO,
BZERO and EXC2 of Table H.2.5-2.3-1 along with a control unit signal SHZERO.
When any of the preceeding signals is a logic zero, the shift selections sig-
nal one, and a zero shift amount is selected.
h. 2. 5. 2. U Multiplication
Measurements of the current model's execution on the IBM/360 revealed
that approximately one-half of the floating point instruction executed are
multiplications. Therefore, we have designed a high speed fully parallel
multi-
plier. The details of this work are given in a Masters thesis by Mr.
William
109
Stenzel (1975). Because the amount of hardware necessary for this multiplier
varies as the square of the operand lengths, we chose to implement a twenty-
four by twenty-four bit multiplier. The rounding operation described in
section h. 2. 5.2.1 rounds floating point values to this fraction precision.
The integrated circuits used in the multiplier are:
1. the SN7US27I+ read only memory which accepts an eight bit address and
returns an eight bit result. It is pre-programmed to accept two four
bit digits and return their eight bit product (Texas Instrument Corpora-
tion, 197^; pp. 262-270),
2. the Signetics N8228 read only memory which accepts a ten bit address and
returns a four bit operand. This device, available as Signetics part
number N8228-CB1105, is programmed to add five two bit numbers and pro-
duce a four bit sum,
3. the SN7U283 four bit binary full adder, which accepts two four bit inputs
and a carry input, and produces a four bit sum and a carry output, and
k. the SN7^S38l arithmetic-logic unit which is used together with SN7HS182
look-ahead carry generators to a final addition step in the mulitplication
process
.
Figure U.2.5.2.U-1 illustrates how to compute the product of two eight bit
values using four SN7US27^ read only memories. Each subscripted symbol in
the figure represents a four bit digit. The four eight bit products are dis-
played in the familiar trapaziodal form and have also been rearranged in a
triangular form. Four bit adders can be used to sum the partial products to
yield the required product. Figure U.2.5.2.U-2 shows the triangular rearrange-
ment for all of the bits in the product of two twenty-four bit operands. A
a.
110
a
i
b
o
a
o
b
o
a h
1 1
aQb1
=> ai\ aib o aob o
a
o
b
i
Figure U.2.5.2. U-l The Product of Two Eight Bit Values
Ill
o
o
o
o
o o
o o
o o
o o
o o
o o
o o
o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o 6 5 o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
5 o o o o
o o o o o
o o
o o
o o
o o
o ol o o
o o o o
o o o o
o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o |o o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o o o o
O O O O
o o o o o
o o o o o
o
o
o
o
o o
o o
o o
o o
o o
o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o o
o
o
o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o o
o o o o
o o o o
o o 5 5 O
o o o o o
o o o o
o o o o
o
o
o
o
o
o
o
u
o CD
•H
o H
o •H
-p
H
u
1
o
•p
o o •H
m
o o
o o o
o o 1
o o -p
o o
o o
!>»
o o £>
o o
o o o
o o
1
>>
-p
o o
o o &
o o 0)
o o EH
o o
CM
o o
-J-
o o CVI
o o LT\
o o (M
o o
(D
o o
u
o o
•H
o o En
o o
o o
o o
o o
o o
o o
o
o
o
o
o
o
112
three sta^e reduction processes results in the required product.
The vertical rectangles in the figure represent Signetics 8228-CB1105
read only memories. The five high order hits of the address, pins three throug
seven, accept the left column of hits - the high order hits of the five two hit
input operands. The five low order hits of the address, pins one, two and
thirteen through fifteen, accept the right column of hits - the low order bits
of the five two bit input operands . The low order bit of the four bit sum
appears on the output pin twelve, the low order bit of the output word.
The horizontal rectangles represent SN7U283 four bit adders.
In the first reduction state, the eleven rows of partial product
bits are reduced to five rows by using twenty Signetics 8228' s and six SNT i+283'.
«
In the second stage, ten 8228 's and six SN? 1+283
,
s reduce the five rows to two.
Nine SNTUS38l's and three SNT ]4Sl82's produce the forty-eight bit product in
the last stage.
k.2. 5.2.5 Division
Three different division algorithms were examined as candidates
for use in this design. They are all similar in two respects:
1. Each algorithm uses the multiplier.
2. Each algorithm uses read only memories to store values which it needs.
The first scheme used a quadratic Chebyshev fit to the reciprocal,
stored the coefficients in read only memories, and used the multiplier to
evaluate the quadratic polynomial. The scheme is not workable because the
polynomial coefficients are relatively large and oscillate in sign, so that
a reciprocal accurate to twenty-four bits could not be computed with the
twenty-four bit multiplier.
113
The second scheme multiplies "both numerator and denominator by
cleverly chosen constants (Garcia, 197^ )• Two multiplications of both
numerator and denominator reduce the denominator to one and the numerator
to the required quotient. The denominator must be normalized so that there
is a one in the high order bit. Call the high order eleven bits of this
normalized denominator "A", and the low order thirteen bits "B". We can
compute a twenty-four bit reciprocal of "A" with six Signetics N8228 read only
memories which accept a ten bit address and report a four bit result. We
can use only ten bits of "A" since the high order bit is known to be a one.
The following sequence of equations illustrates the technique:
N
_
_N_
_
N(l/A)
_
W(l/A)
_
N(l/A)(l-B/A+(B/A) 2 ) _
D " A+B "' (A+B)(l/A) ' 1+B/A ' (l+B/A) (l-B/A+(B/A)^ )
'
N(l/A)(l-B/A + (B/A) 2 )
1+(B/A) 3
By construction, B is less than 2 , and A is greater than or equal to
one-half. Therefore, B/A is less than 2 , so that (B/A) is less than 2 ,
and is therefore negligible in computing a twenty-four bit quotient. Four
multiplications are necessary to compute the quotient using this scheme:
1. N(l/A)
2. B/A from B and l/A
3. (B/A)
k. N(1/A)(B/A+(B/A) 2 )
The third scheme uses Newton's iterative methods. The function
f(x) = Dx-1
llU
will converge to the reciprocal of "D". The derivative f'(x) = D, so that
the equation for the iteration are
x x Dxn - 1 _ x 1
n+1 ' n D " n D
+ h (1-Dxn )
which is identically equal to l/D. The term l/D is the sought and unknown
reciprocal. However, x is approximately equal to the reciprocal, so that the
iteration becomes
x = x + x (l-Dx ).
n+1 n n n
The analytically equivalent form
2
x = 2x - Dx
n+1 n n
can not be computed with as much accuracy as can the preceeding form with
the given processor.
The denominator "D" whose reciprocal is sought must be normalized
in the usual binary sense; that is, its high order bit must be a one.
An
initial twelve bit approximation, xQ ,
is obtained from three Signetics N8228
read only memories by using A(2,10) (see Figure k. 2. 5-2.5-1) as address bits;
A(l) is known to be a one. In this scheme, however, the high order part
of D
should be rounded by adding 2
-12
after the left shift which guarantees that
the high order bit of D is a one.
Programs were written to simulate all three schemes. In the
iterative case, two iterations were always performed; no convergence
test was
done. Therefore, the scheme requires a total of five multiplications
to com-
pute a quotient; two multiplications are needed for each iteration,
and a
final multiplication is required to compute the quotient from the
reciprocal.
115
The simulation programs for the second and third schemes accepted four param-
eters :
1. the desired numerator,
2. the initial denominator,
3. the increment "between successive denominators
,
and
h. the final denominator.
The programs computed all quotients for the indicated range of denominators.
Two pairs of simulation programs were written. One pair computed quotients
correct to twenty-eight bits and compared the approximate values to them.
The second pair of programs computed a quotient rounded to twenty-four hits
for each denominator, and compared similarly rounded approximate quotients
to them. The results of tests using these programs are given in Table
k. 2. 5-2. 5-1. These results led to the choice to implement the third scheme.
The implementation of the third division scheme uses four proces-
sor registers; registers zero to three are used. The first step in the pro-
cess is to move the original denominator to register zero. This is necessary
because one of two tri-state sources supplies the operand to the normaliza-
tion shifters. The normal source is the two-to-one selectors in the upper
right corner of Figure U.2-1. The operand from memory enters the processor
through these selectors. The other source is the zero-to-three bit shift
logic discussed below. A denominator from memory would enter the normaliza-
tion shift logic from two sources when a zero to three bit shift is performed
if B(l,32) of Figure H.2.5-2.5-1 were to come from the memory operand selec-
tors of Figure U.2-1. Hence, the B(l,32) operand unit must come from the
registers. Another implication of this is that the two-to-one selectors
116
oH
en
•H
>
•H
P
en
e
5h
O
<m
Sh
<D
Oh
O
•H
Xi
>
w
o
^
o
en
en
cu
o
O
P-,
&
-P
CH
O
-P
0)
en
§
en
0)
A
I
LT\
CM
UN
J-
cu
w
•H
[x,
117
Numerator
Initial denominator
Final denominator
Increment
100000
g
(i.e. 1/16)
iooooo
l6
200000
g
(i.e. 2/16)
1 (i.e. 2-Sh )
28-bit Quotient 28-bit Rounded Quotients
Item
Multiplicative
Method
Newton 1 s
Method
Multiplicative
Method
Newton 1 s
Method
Sum of Absolute
Values of Errors 7CDA1 .
E
n r16
850B2.3
l6
7CE8U.-0
6
800C9-0
g
Average Absolute
Error (rounded) 0.7Dl6 0.85l6 0.7Dl6
°- 8l6
Maximum Absolute
Error X - 2
16 ^le 2.0 1.0
Sum of Signed
Errors EDD8.6
g
5l»6D.l
l6
-ED2A.0 - -2AF1 . g
Average Signed
Error (rounded) 0.0EE
g 0.05^ -0.0EDl6 -0.02Bl6
Table k. 2. 5-2. 5-1 Results of Tests of the Two Division
Algorithms
which select between the register and the memory operand in Figure U.2-1 must
be the tri-state SN7^-S257 for the fraction part of the operand.
The second step of the algorithm uses the zero to three bit shift
logic of Figure U.2.5-2.5-2 to shift the original denominator left by zero to
three bit positions so that the high order bit is a one. Since the logic
assumes that a three bit or smaller shift will suffice for this operation,
118
SHIFT DIRECTION
FROM DIVISION
ROM
CONTROL CONTROL
CONTROL CONTROL
1(8)
uri-
ne)
1(5)
1(7)
1(6)-
1(5)
1(4)
1(3)
1(2)
1(11-
LEFT SHIFT
ENABLE
V)
m
m
CM
<
1(8)
1(7)
1(6)
1(5)
1(4)
1(3)
1(2)
0L(8)
OL(7)
OL(6)
OL(5)
0L(4)
OL(3)
OL(2)
OL(l)
1(4)
1(3)
1(2)
1(1)
RIGHT SHIFT
ENABLE
CO
m
C\J
2
<
m
<
0R(8)
0R(7)
OR(6)
OR(5)
w> DO(8)
& DO(5)
0R(4)
0R(3)
OR(2)
OR(l)
& DOdl
DSHIFT(1,2)
Figure U . 2
.
5 - 2 - 5-2 The Zero to Three Position
Shift Logic
119
the original denominator must be a normalized value. The logic of Figure
U.2.5-2.5-2 relies on the AM25S10 tri-state four bit shifter. The figure
illustrates both a left and a right shifting capability. Each AM25S10
accepts seven input bits, a two bit shift amount, and a tri-state enable
signal. The two bit shift amount determines which of four sets of four
contiguous input bits are output by the device. By using correct overlapping
bit assignments to multiple AM25S10's, operands with more than four bits can
be shifted. Figure ^. 2. 5. 2. 5-2 illustrates shift logic for eight bit input
operands; shift logic for thirty-two bit values requires sixteen rather than
four AM25S10's. Whether the ensamble of Figure ^.2. 5.2. 5-2 shifts left, as
required by the second division step, or right, as required by a later step,
is determined by the logic at the top of the figure. For this step, control
signals from the control unit force a left shift, and cause the division ROM
output to be ignored. The SNT^l 1^ of Figure U.2.5-2.5-1 computes the shift
amount for the zero to three bit shift logic by examining the three high
order bits of the original denominator as stored in processor register zero.
The shifted denominator is stored in processor register one.
The third step of the algorithm rounds the shifted denominator value
-12by adding 2 to it
.
The constant for this rounding operation comes from
the left operand selector described in section k. 2. 5. 1.5. Let us call the
original denominator D and the shifted and rounded denominator D in the fol-
lowing discussion. The rounding step which produces D can result in an over-
flow; the carry out of the adder, ACOUT, is recorded in the C flip-flop of
Figure k. 2. 5-1-12. 1-1 for the later use in the division process. If overflow
occurs during denominator rounding, the special shift of one bit position in
the fraction selector (section U.2.5.I.7) is used to force the rounded result
120
800000 ,, or exactly one-half.
The fourth step of the algorithm uses the division xQ
ROM's of
Figure U.2.5.2.5-1 and D to compute x , the first approximation to the de-
sired reciprocal. This value varies from FFF000l6
for a D value of one-
half, to 800000 , for the D value FFFOOO^. The value actually stored "by
the ROM's must be a logic complement of the correct, rounded binary value,
since the adder operates on active low data values and the fraction selector
complements to account for this. The value from the ROM's is thus between
one-half and l-2~
13 inclusive; since it represents the reciprocal of D, which
is between one-half and l-2~
13 inclusive, it can be represented for the
analysis below as Jgx . The resulting value is stored in register two.
In step five, we compute h - \ xQD in one step by
using the multi-
plier to supply the product term and using the left operand selector to
supply the constant \. The result of this step is 3g(l - xQ
D), which is a
small value even for the first of the two iterations. Thus, step six
adds
the result of step five to itself to scale it up to the value 1 - xQD.
Register three is used to store both of these results.
Step six computes hx Q (l
- x
Q
D) by using the multiplier with
hx from register two and (l - x D) from register three.Q
Step seven adds Jgx from register two to the result of step six
(from register three), and produces h{xQ + xQ (l -
x
Q
D)) or kx
±
.
Steps nine through twelve repeat steps five through eight, except
that they use 3gx^ instead of h*Q
throughout. The result is hx? ,
or, in
other words, h of the reciprocal of D.
Step thirteen uses the multiplier to compute the exponent adder
to
121
compute the result exponent and Q = (N/D). But we seek Q = N/D. The
form of Q is x.xxx..., where each "x" represents a bit. Since D was produced
by shifting D left, that is by multiplying the original denominator, the
correct Q is the result of a similar shit of Q. This shift, conceptualized
by a right shift of the binary point, results in a Q with one of the four
following forms
:
x . xxxxx ...
x
( 1
)
xx . xxxx ... ( 2
xxx . xxx ... ( 3
XXXX . XX ... X ( 1* )
Since N, the original numerator, ± s also a floating point fraction, it has
from zero to three leading zero bits. Hence, each of the four forms above
can have from zero to four leading zero bits. Moreover, an overflow in step
three of the division algorithm means that the original denominator, D, was
actually shifted left one less position than an examination of D would
imply; this fact is recorded in the D flip-flop. Table U.2.5.2.5-2 summar-
izes these conditions. The upper left part of each table entry indicates
the amount and direction of a zero to three bit shift which is required to
bring the binary point to one of the following positions.
•xxx.
. . .x, or (5)
xxxx . xx ... ( 6
)
A left shift can occur when the number of high order zero bits in Q is
greater than or equal to the number of bits to the left of the binary point
in the form which Q takes among the forms (l) through (U) above. The lower
right part of each table indicates the exponent alteration which is necessary
122
to convert Q to the proper quotient Q. The exponent correction is effected
by the exponent correction adder. When form (5) results from the zero to
three bit shift, no exponent correction is required. When form (6) results,
the exponent must be reduced by one. When Table U.2.5-2.5-2 indicates that
a shift of four places is required, this is achieved by a shift of zero
places in the zero to three position shift logic and a shift of one place in
the normalization shift logic. In all other cases, the normalization shift
logic shifts by zero places.
Leading Zeros
in Q,
Leading Zeros in D
Table U . 2
.
5 . 2
.
5—2 to k Leading Zeros
Although the original denominator must be normalized, the numerator
N need not be. A product with four (or more) leading zeros will
result when
the numerator is not normalized. The quotient is not normalized
when the
original numerator is not normalized. The quantity Q will also have
four
leading zero bits when the reciprocal is nearly h and N has its
high order
123
to be truncated to an integer goes into the processor logic as the right
operand. Its exponent is used as the address for a Signetics 820U read only-
memory whose output, IFUNC(l,T) of Figure k. 2. 5.1.6-2 controls the high order
six SN7US381 arithmetic-logic units of the adder separately, and always forces
ones in the seventh and eighth units. The logic assumes that the operand is
normalized, andforcesthe correct number of fraction digits to ones (complimented
to zeros by the fraction selection logic). The SNTUS38l's in the adder
either add the operand fraction to a forced zero operand from the left
operand selector, or they force ones as output. The function for addition
is Oil and that for forcing ones is 111 (see Table k. 2. 5.1.3-1) . The high
order bit is supplied by the SIG8205, and the two low order bits are
supplied as CUAFUNC ( 2 , 3 ) . The eighth ouput bit of the SIG8025 goes to the
overflow flip-flop logic as INTRUNC, and is a logic one when the operand
value cannot be represented in the six hexidecimal integer digits permitted
one bit followed by several zero bits. The product of N with the reciprocal
will then produce a non-normalized result, or one with four zeros.
A shift amount value of Rx in Table U.2.5.2.U-1 means that a shift
right of x bit positions is required. A shift amount of Lx means that a left
shift of x bit positions is required.
U.2.5.2.6 Integers
The integers are represented and manipulated as floating point
numbers in this design. The fractional part of an integer is zero. Logic
is included to truncate the fraction part of an arbitrary floating point
number. The largest integer that can be represented is 2-1. A larger
integer value can be represented by the thirty-two bit fraction of the pro-
cessors, but memory can retain only twenty-four bit fractions. The value
12U
in the design.
An exponent value of zero or less will produce an integer value
of zero. An exponent value between one and six inclusive produces an integer
with the corresponding number of potential non-zero hexidecimal digits. An
exponent value of seven or more results in an integer truncation overflow
condition.
U. 2. 5.2.7 Double Precision Addition and Subtraction
Measurements of the current model's execution on the IBM/36O
reveals little required double precision operation. Therefore, we have
designed a single precision processor which is augmented with the minimum
additional hardware needed to permit double precision calculations. Twenty
processor cycles are required to perform a normalized precision addition or
subtraction. A double precision value consists of two single precision
values, each with its own correct exponent and fraction. The high order
part must always be normalized; the low order part contains the least
significant six of the twelve fraction digits, whatever they may be, and
therefore, has a normalized form only by coincidence. However, if the high
order fraction is zero, the low order fraction must also be zero. The signs
of both parts must agree.
Implementation of double precision addition and subtraction uses
six processor registers. The normalized result is left with the high order
part in register zero and the low order part in register one. Intermediate
double precision operands in the processor have fourteen fraction digits,
six in the high order part and eight in the low order part. The two
low
order digits of the high order parts are always zero at the completion of
125
UJ
M
-J
<*
1-
U-
2 I
cr en
o
z
UJ
_l o
1!
UJ -
1
L
OD
in
<\j
"^
r-
m
r-i
Z
en
I
SN74S257
K
?
K
in
z
en
u.
X
COQ
J
\-
z
UJ (-
2 u-
z X
_j
<
tO <\J
r-l
K
I
O
ac
to
i-T
K
X
13
cr
pH
CD
UJ
-1 o
CD —
P
< O
)
z
UJ
i
CD
_l
cr
a
IB
t-
1
fO a
u.
X
en
IS
t-
U.
X
en
a
z
UJ O
m — f-,3 ° nro w a-
o a
a.
SN74S257
cr
oH
O
UJ
_l
--UJ--
• CO
z
o
(-
o
<
cr
ii-
to
CD
N-H
cr
UJ
oQ
<
Q-
X w M
UJ
m t-
UJ P=
o £
X <
UJ
<
Ul
CM
IO
*H
K
U.
UJ
-I
5
a.
o
XO u
o
UJ
X
UJ
m
<X C
UJ
_l o
UJ
<
L
r
c
(
c
U
o
o
_i
J
"'
D
3
CO
L
<
a
z
< crK O
UJ h-
0- OO UJ
Huj
U- CO
UJ
c\j
to
<
(-
z
UJ h-
2 u-
£ en
<
—i
cu
H
J3
2
GQ H
!h -P
O o
<H
5h P
O J3
W 3
U3 to
CU
CJ TJ
Q C
IH 3
Ph
a
(1) o
JZJ •Hp PH
Ch Ti
O
<P
0) C
en o
rQ •H
2 ra
CO. •H
CU CU
,G UH Vu
H
1
co t-
in •
pj CM
t-
•
O \S\
< •
cr CM
u.
J-
_ CU
<r !H
CM 3
-H bO
~—
•H
t- ho
<
or
126
an operation. The subset of the processor logic which performs double pre-
cision addition and subtraction is shown in Figure k. 2. 5.2. 7-1. The logic
relies on the double precision read only memory of this figure for much of
the specialized control which is required.
Several of the steps in the double precision addition and subtrac-
tion process are really fixed point addition of two fractions without regard
to their signs or exponents. The exponent correction adder permits control
from the control unit of which exponent is assigned to a result. The selec-
tion of the sign is also subject to complete control by the control unit.
Hence, a fixed point addition of two fractions can be assigned to the
exponent of either fraction and the sign of either fraction.
The complete double precision addition and subtraction process is
illustrated by Figure U.2.5-2.7-2, Figure k. 2. 5-2. 7-5 , Figure h. 2. 5-2.7.-9
,
and Figure h. 2. 5 .2.7-10. In these figures, the exponents and individual
digits of all operands are shown. The digits of the two original operands,
X and Y, are denoted by XI through XlU and Yl through YlU respectively.
The
process determines which of the two operands is the larger and which is the
smaller. The digits of the larger are denoted by LI through LlU ; the
digits
of the smaller are denoted by SI through SlU. Finally, the digits
of the sum
or difference are denoted by Tl through TlU. The operation portrayed
by the
figures is:
T = X + Y.
The original operands are shown in Figure k.2. 5.2.7-2(a) . In the first
step
of the process, the high order part of X is written into registers
zero and
one; the operand registers permit writing a value to two different
registers
127
X: Ex XI X2 X3 XU X5 X6
Y: Ey Yl Y2 Y3 Yk Y5 Y6
Ex+6 XT X8 X9 X10 Xll X12 XI 3 Xlk
Ex+6 X7 X8 X9 XIO Xll X12 X13 XlU
(a)
1
Ex XI X2 X3 XU X5 X6
Ex
i
XI X2 X3 Xh X5 x6
El LI L2 L3 Lk L5 L6
Es SI S2 S3 S^ S5 S6
El LI L2 L3 hk L5 l6 o
Es |S1
1
S2 S3 SU S5 s6 o
i
Es SI S2 S3 Sk S5 s6
Es SI S2 S3 SU S5 s6
(b)
(c)
(d)
Figure h. 2. 5.2.7-2 Preparatory Double Precision Addition
and Subtraction Steps
128
in one operation (see section 1+. 2. 5-1-10) -
In the second step, the two high order parts of the operands are
passed through the processor logic. The X operand is the left operand and
the Y operand is the right operand since it may come from memory. The Y
operand is always passed through the adder and fraction selector. The S
logic, shown in Figure U. 2. 5. 2. 7-3 determines whether the Y operand is
larger or smaller than the X operand. A zero operand is always regarded as
the smaller regardless of its exponent value. If the Y operand is larger,
the S signal is zero; if the Y is smaller, the S signal is one. The result
of the comparison, the S signal, is stored in the S flip-flop of Figure
U. 2. 5.2.7-3 for use in routing the low order halves of the operands in a
later step. Table k. 2. 5.2.7-1 explains the input signals for the S logic,
and Table k. 2. 5.2.7-2 gives the truth table for the S logic.
The logic which varies the operand register address bits to accom-
plish the local control needed by this and other steps in the double preci-
sion addition and subtraction process is shown in Figure k. 2. 5-2. 1-h. The
signal is used, together with three zero address bits from the control unit,
to select either register zero or register one during this step. The net
result of step two is shown in part (c) of Figure k. 2. 5-2.7-2; the larger
operand is stored in register zero and the smaller in register one.
Step three duplicates the smaller operand in registers four and
five. The Z flip-flop is set to indicate whether the smaller operand is
zero
The rest of this step is shown in part (d) of Figure h. 2. 5-2. 7-2.
The next five steps align the fraction of the operands in prepar-
ation for the addition or subtraction steps. These five steps are shown in
129
Signal Value
AZERO
BZERO
ABEXEQ
i
EXC2
AGTR
Significance
The left operand fraction is zero.
The right operand fraction is zero.
The operand exponents are equal.
The left exponent is greater than or
equal to the right exponent.
The left fraction is greater than the
right fraction.
Table U. 2. 5.2.6-1 The Significance of the S Logic Input Signals
Signals
SN7US150
Input CommentsAZERO or BZERO ABEXEQ 1 EXC2 AGTR
1
1
1
X
X
X
O
rH
X
1
X
X
X
1
X
1
BZERO
Y greater than or equal to X
X equals Y
X greater than or equal to Y
Y greater than X
Exactly one operand is zero.
Table k. 2. 5.2.7-2 The Truth Table for the SN7US150 of the S Logic
130
16 INPUTS
SN74S150
CONTROL
SELECT
S16
SN74S157
O
o o UJ
CC CE X C\J cc
UJ UJ UJ O \-M M CD X o
< CD < UJ <
u
SN74S74
CLOCK
SFF
CUSUBIN
SFFBAR
-^SN74S86
CONTROL CONTROL
CUSUB
Figure 14.2.5.2.7-3 The Logic for the S Signal
131
Figure k. 2. 5-2. 7-5- The figure covers two cases. In the left column are
successive register states for the case when the exponent difference is less
than six; the right column covers the case where the exponent difference is
greater than or equal to six. The exponent difference illustrated by the
left column is three; that for the right is seven. The double precision ROM,
which is crucial to many of the following steps, is shown in detail in
Figure k. 2. 5.2. 7-6. It can be implemented with a Signet ics 820U read only
memory. This ROM stores 256 eight bit words. The eight bit address is used
as shown in the figure. One control signal from the control unit determines
whether an alignment or normalization shift control result is desired;
another control signal specifies whether a left shift or right shift is
required. The other bits contribute to determining the shift amount. The
operand which is to be shifted is always known beforehand, and is sent through
the logic as the right operand. Table k. 2. 5.2.7-3 summarizes the functions
performed by the double precision control ROM during the operand alignment
phase. The symbol "d" in the table represents the exponent difference.
Step four performs a left shift of the smaller operand by the amount
given in Table k. 2. 5-2. 7-3. The control ROM uses signals DCADDR(l) and
DCADDR(3) as shown in Figure U.2.5.2.7-U to store the result in register four
when the exponent difference is less than six and in register one when that
difference is greater than or equal to six. The results of step four are
shown in Figure h. 2. 5.2. 7-5 (a)
.
Step five performs a right shift of the smaller operand, taken
from register five, by the amount given in Table k. 2. 5.2.7-3. The control
ROM again uses DCADDR(l) and DCADDR(3) as shown in Figure k .2. 5.2. 7-1+ to
132
i— s
CONTROL -
DCADDR(l) 1
CUCADDR(l) CONTROL
CONTROL DCADDR(3 )
CADDRESS(l)
CONTROL
CUCADDR(3)
i— CONTROL
TO
v
SN74 H52
CADDRESS(3)
-CUBADDR(3)
i
— CONTROL
BADDRESS(3)
Figure U.2.5.2.7-U Logic for Local Control of
Operand Register Addresses
133
Exponent Difference 6
El
Es
El
Es
LI L2 L3 Lk L5 L6
SI S2 S3 SU S5 S6
SU S5 S6
SI S2 S3 Sk S5 S6
Exponent Difference 6
El LI L2 L3 Lk L5 L6
Es
Es
Es
SI S2 S3 Sk S5 S6
SI S2 S3 SU S5 S6
(a)
El
Es
LI L2 L3 Ll* L5 L6
SI S2 S3
1* El SU S5 s6
El LI L2 L3 LU L5 L6
1 Es SI S2 S3
2 Ex-6 XI X2 X3 XU X5 x6
3 Ex-
6
XI X2 X3 Xk X5 x6
14 El \sk S5 s6 o
El LI L2 L3 LU L5 L6
1 Es SI S2 S3
C- El-6 LT L8 L9 L10 Lll L12 LI 3 Lll*
-J Es-6 ST S8 S9 S10 Sll S12 S13 Sll*
k El SU S5 S6
El LI L2 L3 Lk L5 L6
1 Es SI S2 S3
2 El-6 LT L8 L9 L10 Lll L12 LI 3 LlU
3 Es-6 ST S8 S9 S10 Sll
k El Sk S5 S6
El
Es
El-6
Es-6
LI L2 L3 Lk L5 L6
SI S2 S3
LT L8 L9 L10 Lll L12
Sk S5 S6 ST S8 S9
LI 3 LlU
S10 Sll
El LI L2 L3 Ll* L5 L6
Es
El SI S2 S3 Sk S5 Sb*""
El LI L2 L3 Ll* L5 L6
Es
Ex-6 XI X2 X3 Xk X5 x6
Ex-6 XI X2 X3 Xk X5 x6
El SI S2 S3 Sli S5 s6
El LI L2 L3 Lk L5 L6
Es
El-6 LT L8 L9 L10 Lll L12 L13 Lll*
Es-6 ST S8 S9 S10 Sll S12 S13 Sll*
El SI S2 S3 Sk S5 S6
El LI L2 L3 Ll* L5 L6
1 Es
2 El-6 LT L8 L9 L10 Lll L12 L13 Lll*
3 Es-6
1* El SI S2 S3 Si* S5 s6
(b)
(c)
(a)
(e)
El Ll L2 L3 Ll* L5 L6
1 Es
2 El-6 LT L8 L9 L10 Lll L12 L13 Lll*
3 El-6 SI S2 S3 Si* S5 S6 ST
(f)
Figure k. 2. 5. 2.7-5 Alignment Steps in Double Precision Addition and
Subtraction
13fc
X
COQ
CO
I-
_J
CCQ
CO
Q
U
o
e
CD
o
->E
LU\ N
15.H
o
CO
GO
<
L
ro
CO
m E>\
O
to
C\J
CO
O
CM
00
O
t-i
ro
orQQ
<
o
orQ
O
<OQ
o
5h
-P
O
o
o
•H
ra
•H
a
<u
u
(X,
0)H
9
o
o
Cm
O
•H
d
-P
a;
o
EH
VD
I
t-
CM
LT\
CM
_t
(D
M
Pn
CO
-|CM
M
135
Shift
Direction
Shift Amount
d < 6 d > 6
Left
Right
6-d
d
d
d-6
Table k. 2. 5. 2. 6-3 Signetics 8205 Control ROM Shift Amount
During the Operand Alignment Phase
store the result in register one when the exponent difference is less than
six and in register four when that difference is greater than or equal to
six. This shifted result must have its two low order digits both zero. This
is necessary for step eleven to compute a correct high order part. The two
low order digits, FRACT(25,8), are forced to zero by causing the two SIG8263
selectors of the fraction selection logic (Figure U. 2. 5. 1.7-1) which produce
these bits to emit zeros during this step. This is accomplished by setting
both bits of their selection signal to zero and their complement signal also
to zero (see Table U.2.U.2-1). The results of this step are shown in Figure
U.2.5.2.T-5(b).
Step six loads registers two and three with the low order part of
X. Step seven is similar to step two. The contents of the S flip-flop, as
shown in Figure U.2.5-2.7-^, are used to direct the low part of Y to register
two when Y was the larger operand in step two, and to register three when Y
was the smaller operand in step two. The state of the registers after step
seven is shown in Figure k.2. 5.2. 7-5(d)
.
In step eight, a normal floating point alignment operation results
in a shift right of the smaller lower order part , taken from and returned to
register three, by the amount of the exponent difference. The result of this
136
step is shown in Figure k, 2. 5 .2.7-5(e) . Of course, when the exponent dif-
ference exceeds seven, the contents of register three after this step is
zero. Step eight combines the contents of register three and four by addi-
tion with forced alignment shifts of zero places to produce the correct low
operand for the addition or subtraction step. The result of this step is
shown in Figure U.2. 5. 2. 7-5 ( f ) • At this point, the two high order operands
are in registers zero and one, and the two low order operands are in
registers two and three.
The actual addition or subtraction process is complicated by the
fact that sign-magnitude representation is used for floating point values in
this design. The actual operation which must be performed depends not only
on the instruction being executed but also on the signs and relative mag-
nitudes of the operands being processed. If one of the operands is zero,
the result is the other operand, possibly with its sign reversed. If two
operands with equal exponents are to be added, the actual operation performed
depends on their signs. When the signs are the same, the magnitudes are
simply added, and the sign of the result is that shared by the two operands.
However, when the signs differ, the smaller magnitude must be subtracted
from the larger, and the sign of the result is that of the larger operand.
During double precision addition and subtraction, the function which the
adder must perform is usually determined by the high order parts of the
operands. But, for example, when the signs are unlike during an addition,
the relative magnitudes of the low order parts of the operands will deter-
mine the operation when the high order parts are equal. In step nine, the
D flip-flop of Figure k. 2. 5. 2. 7-7 is set according to the truth table in
137
ABEXEQ
ViSN 74S04
[>„
ABEQ
74S11
iSN74S74 a CONTROLCLOCK
jSN74Sll
D DBAR
Figure U.2.5.2.7-7 The D Flip-flop Logic
138
Table k. 2. 5.2.7-3. For this step, the two high order parts are passed
through the logic, and the adder function which they require is determined
by the Signet ics 8205 read only memory of Figure k. 2. 5. 2. 7-8. The adder
function is stored in the NAT8551 tri-state register, but the result of the
operation is not stored in the operand registers. The D flip-flop is set
to a logic zero when the high order parts of the operands determines the func-
tion; the D flip-flop is set to one only when both the high order exponents
and fractions are equal, so that the low order parts must determine the
function. The operand registers at the end of step nine are the same as they
were previous to this step. However, the D flip-flop and the NAT8551 are set
by the step for use in step ten.
Input Signals
ABEXEQ
i
ABEQ
D Flip-flop
Setting Comments
Operands not equal
The operands are equal
Operands not equal
Operands not equal
Table k. 2. 5.2.7-3 Truth Table for the D Flip-flop
In step ten, the low order parts of the operands from registers
three and four are added or subtracted using the contents of the NAT8551
when the D flip-flop setting from step nine is zero and using the output of the
SIG8205 control ROM when the D flip-flop setting from step nine is one.
When the relation of the low order operands should determine the adder
func-
tion (that is, when the D flip-flop is one), the SIG8205 function output
is
clocked into the NAT8 551 during step ten processing. The high order
carry
out of the adder during step ten is saved in the carry flip-flop, C.
This
139
MH
a
bO
•H
CO
rH
O
c
o
o
o
Q
cti
o
0)
p
o
Cm
OHW
o
0)
Eh
oo
I
OJ
LP\
C\J
H
En
lUo
carry must be propagated to the high order operation, which occurs in step
eleven. The results of step ten are shown in Figure k . 2. 5 .2. 7-9(a) . The
low order result is stored in register three. The normal operation of the
fraction selection logic is aborted for this step; no right shift is per-
formed if a fraction overflow occurs. Instead, the carry flip-flop contents
propagate the overflow condition to the high order operation.
Step eleven uses the function stored in the SNTUS6T0 and the carry
stored in the carry flip-flop, C, to compute the high ord^r part of the re-
sults. So that the carry can propagate across the eight low order bits
which are ones in both operands (active low zeros), the two low order SNTUS15T
quadruple two-to-one selectors which select the output of the wire AND shown
in Figure k. 2. 5-2.7-1 are made to supply zeros (active low ones) by setting
their strobe inputs to one for this step only. The result of this operation
is shown in Figure k. 2. 5 . 2. 7-9(b ) . The left part of the figure shows the
case for which no fraction overflow occurs; the right part shows the
result
when fraction overflow does occur. The high order part of the result
is
left in register zero and the low order part in register two by this
step.
The one bits introduced to propagate the carry must be removed by
the fraction selection logic. The two SIG82H3 three-to-one selectors
which
forced the two low order digits to zero in step five are used.
They operate
under processor control to force two digits to zero when no
fraction over-
flow occurs, and they force one digit to zero when a fraction
overflow does
occur.
Step twelve shifts the high order part of the result left
six
lUl
Set Function from (0) and (l
El LI L2 L3 LU L5 L6 i
Es SI S2 S3
Es-6 T7 T8 T9 T10 Til T12 T13 TlU
Es Ll L2 L3 LU L5 L6
i
Es
|
El--6 T7 T8 T9 TIO Til T12j T13 TlU
(a)
El Tl T2 T3 TU T5 T6
I El+1 1 Tl T2 T3 TU T5 |t6
2 [ei-6 jTT T8 T9 TIO Til T12Jt13 TlU| 1 |e1-6|T7 T8 T9 TIO Til T12IT13 TlU
; (b)
El
El-6
Tl T2 T3 TUOOOO T5 t6
El-6
L
T7 T8 T9 TIO Til Til T13 TlU
El+1
El-9
El-6
1 Tl T2 T3 TU T5 T6
T6 000 I
T7 T8 T9 TIO Til T12 T13 Tl^
(c)
El
El-6
Tl T2 T3 TU T5 T6
T7 T8 T9 TIO Til T12 T13 Tlk
El+1
El-5
1 Tl T2 T3 Tk T5
T6 T7 T8 T9 TIO Til
T6
T12 T13 (d)
El
El-6
Tl T2 T3 Th T5 T6
T7 T8 T9 TIO Til T12 T13 TlU
El+1 | 1 Tl T2 T3 TU T5 i T6
E1+3JT6 T7 T8 T9 TIO Til T12 T13 (e)
Figure k. 2. 5-2.7-9 The Addition Steps in Double Precision
Addition and Subtraction
1U2
places and stores the shifted value in register one. The control ROM will
output the value six required to control the shift if the register zero
operand is sent through the logic as both the left and right operands. One
of the operands is forced to zero by its alignment shift logic, and the
other shifted six left passes through to register one. The results of step
twelve are shown in Figure k. 2. 5 .2.7-9(c )
.
Step thirteen is an ordinary unnorraalized addition of the contents
of registers one and two. The result is stored in register one, and it is
the correct low order part for the double precision operation. Steps twelve
and thirteen served to transfer a possible TT digit from the high to the low
order part of the double precision fraction. The results of step thirteen
are shown in Figure U.2. 5.2.7-9(d) . The zero flip-flop is set to indicate
whether the high order fraction result of this step is zero.
In step fourteen, the high order part is passed through the logic
and two low order zero digits are forced by the fraction selection logic
to
clear a possible TT digit from the high order part of the result. The
results of step fourteen, a correct but unnormalized double precision float-
ing point addition or subtraction result, are shown in Figure k.2. 5
.2.7-9(e)
•
The result must be normalized. If the high order fraction is zero
but the low order one is not, the logic which controls the adder function
selection for double precision operations will not work correctly. The
five
steps which are required to normalize the result are shown in Figure
k. 2. 5. 2. 7-10. The left column of the figure details with the case
in which
the high order fraction is zero; the right column treats the case in which
the high order fraction is not zero.
1U3
E
E-6 T8 T9 T10 Til T12
1
T13 TlU
E T3 TU T5 T6
r
E-6 T7 T8 T9 T10 Til T12 T13 Tin
(a)
E-7
E-6
T8 T9 T10 Til T12 T13
T8 T9 T10 Til T12 T13 TlU
E-2 T3 Ti+ T5 T6
E-6 T7 T8 T9 T10 Til T12 T13 Tin (D)
E-7 T8 T9 T10 Til T12 T13
E-6 T8 T9 T10 Til T12 T13 TlU
E-6
E-2 T3 TU T5 T6
E-6 T7 T8 T9 T10 Til T12 T13 TlU
E-6
(c)
E--7 T8 T9 T10 Til T12 T13
E--6 T8 T9 T10 Til T12 T13 TlU
E-2
E-8
T3 TU T5 T6 T7 T8
T9 T10 Til T12 T13 TlU
(e)
Figure U.2. 5.2. 7-10 The Normalization Steps in Double Precision
Addition and Subtraction
ikk
The first step of the normalization process uses the Z flip-flop
state and the logic of Figure U.2.5-2.7- 1* to select the register zero operand
when the high order fraction is non-zero and the register one operand when
the high order fraction is zero. The initial operands for normalization,
assumed results of the addition or subtraction, are shown in Figure
)i.2.5.2.T-10(a). The results of this step, an ordinary normalization step,
are shown in Figure k . 2 . 5 -2 . 7-10 (h)
.
The second normalization step uses the values from register zero
and register one. The exponent difference is used by the control ROM in the
normalization mode to compute a right shift amount. Table U.2.5-2.7-
1
*
summarizes the function of the SIG8205 control ROM for the normalization phase
of double precision operations. The symbol "d" in the table represents the
exponent difference between the register zero and register one operands.
Shift
Direction
Left
Right
High Order Fraction
Zero
6+d
6-d
Not Zero
6-d
d
Table U.2.5.2.7-U Signet ics 8205 Control ROM
Shift Amount During the
Normalization Phase
The second normalization step shifts the low order fraction right
by the amount specified by the SIG8205 control ROM. The two low
order digits
of the shifted result are forced to zero by the FRACT(25,8) selectors
of the
fraction selection logic. The results of this step are shown in
Figure
U.2.5.2.7-10(c). The shifted result is stored in register three.
1U5
The third normalization step adds the contents of registers three
and zero and stores the result in register zero. The result of this step is
shown in Figure k.2. 5.2.7-10(d)
.
The net effect of steps two and three is
the transfer of fraction digits from the low to the high order part of the
double precision fraction.
The fourth normalization step shifts the low order fraction left
by the amount specified by the SIG8205 control ROM. The shift amount computed
by the ROM is subtracted from the exponent of the low order operand so that
the final exponent result is correct. The amount subtracted from the exponent
is thirteen for the case when only one non-zero fraction digit is produced as
digit ilk of the addition or subtraction result. Thus, although the normali-
zation shifter is disabled so that it outputs a zero when the shift amount
exceeds seven, an amount of up to thirteen must be able to go from the SIG8205
to the exponent adder. The result of this step is a correct normalized
double precision addition or subtraction result. The zero flip-flop is set
on this step to indicate whether the low order fraction is zero.
The last normalization step tests the high order fraction for zero,
and ANDs the result of the test into the zero flip-flop (see Figure h. 2. 5.2.12-1)
Hence, the flip-flop will be zero after a floating point double precision
addition or subtraction only if both fraction parts are zero.
U.2.5.2.8 Double Precision Multiplication
Figure k. 2. 5.2.8-1 shows the partial products which contribute to
a double precision multiplication result. In this design, two double preci-
sion operands are multiplied to yield a double precision result. The low
order part of that result is not produced. The figure displays the product
lU6
A = (Al, AO)
B = (Bl, BO)
• AO*BO
Figure U. 2. 5.2.8-1 The Partial Products in Double Precision
Multiplication
ihl
of A=(A1, AO) by B=(B1, BO); Al and AO are the most and least significant
part of the double precision number A, respectively. The products A1*B1
and A0*B1 are computed first; four registers store the product results. They
are combined into two values by addition of the low order parts and propaga-
tion of the carry to the addition of the high order parts. The carry from
the high order addition is saved for later addition to the high order part
of the product A1*B1. The product A1*B1 is computed and the saved carry is
added to the high order part. The high order part of the sum of the middle
partial products is then added to the product A1*B1. The carry is propagated
across. Finally, the product A0*B0 is computed. It is added to the low
order part of the sum of the middle partial products, and the cary - if any -
is propagated by two additions.
Twenty steps are needed to complete the process. They are:
1. Multiply: Compute A1*B0 and store the high order part in register one.
The low order exponent of the final product is computed in
this step.
2. Store: Store the low order part of the product in register two.
3. Multiply: Compute A0*B1 and store the high order part in register zero
h. Store: Store the low order part in register three. The addition
with the low order part of A1*B0 which follows cannot be
done on the fly because the operands for the multiplication
must continue to be supplied by the operand registers.
5- Add: Add the low order parts of the above products and save the
carry. Store the result in register two.
6. Add with Add the high order parts of the above products together
carry: with the saved carry from the low order parts. Store the
result in register one. Save the carry from this addition.
7. Multiply: Compute A1*B1 and store the high order part in register
zero. The high order exponent of the final product is
computed in this step.
1U8
8. Store:
9. Add carry:
10. Add:
11. Add carry:
12. Multiply:
13- Add:
lh. Add carry:
15. Add carry:
Store the low order part of the A1*B1 product in register
three
.
Add the carry saved from the previous addition to the high
order part to the A1*B1 product.
Add the contents of register one to the low order part of
the A1*B1 product from register three. Save the carry out
of this addition.
Add the saved carry from step (10) to the high order part
of the product in register zero.
Compute A0*B0 and store the high order part in register
three.
Add the high order part of A0*B0 to the low order part of
the sum of the middle partial products. Save the carry
from this addition.
Add the saved carry to the low order part of the final
result in register one. Save the carry from this addition,
Add the saved carry to the high order part of the final
result in register zero.
The result of the above fifteen steps is the unnormalized double
precision product of the initial double precision operands. Five normaliza-
tion steps exactly like those which were used to normalize the double preci-
sion addition or subtraction result complete the operation.
U.2.5.2.9 Double Precision Division
Double precision division can be implemented by a process which
parallels that for single precision division described in section U. 2. 5.2.5-
The initial approximation to the reciprocal is computed by a single precision
division. An iterative procedure based on the equation
X r = X + X (1 - Dx )n + 1 n n n
is carried out. We did not determine the number of iterations which would
be
required, but it would be two - perhaps three. The term "D" above is the
lU9
original double precision denominator, and the successive x terms are approxi-
mations to the reciprocal. Double precision multiplications are used to per-
form the iterations, and fixed point double length additions combine the terms
as they did in the single precision division case. A final floating point
multiplication by the original numerator computes the computation of the re-
quired quotient.
h. 2. 5.2.10 Multiplication and Division by a Power of Two
In many of the multiplications and divisions which the model exe-
cutes, one of the operands is a power of two. The logic described in this
section performs a multiplication or division by a power of two in one processor
cycle. The power of two in the operation is specified by a six bit value,
CSHIFT(l,6) of Figure h. 2. 5 .2.10-1. In a machine with an exponent radix of
two, all of these bits would be added to the exponent for multiplication by
a power of two and subtracted from it for division by a power of two. In
this design, however, the exponent radix is sixteen. Thus, the two low order
bits of the power of two determine a shift of the fraction, and the four high
order bits of the power of two are added to or subtracted from the exponent.
The control aspects of the logic are shown in Figure k. 2. 5 .2.10-1. The heart
of the process is the Signetics 820^4 read only memory. It accepts CSHIFT
(5,2), the two low order bits of the power of two, the three high order bits
of the fraction, and a signal which specifies whether multiplication or divi-
sion by a power of two is desired. The output from the read only memory
controls the zero-to-three position shifter with a two bit amount and a one
bit shift direction signal, and it controls the exponent correction adder with
150
<&
I
en
o
O
0)
P-H
O
>HQ
fl
o
•H
-P
cd
o
•H
H
Ph
•H
-PH
S
U
O
O
O
-P
CI
O
u
(U
Eh
IOH
CM
LTN
CM
CD
bOH
151
High Order
Fraction
Zero Bits
Shift
Table h. 2. 5.2.10-1 Control Details for Multiplication by a
Power of Two
Table U. 2. 5.2.10-2 Control Details for Dibision by a Power
of two
152
a one bit function signal and a one bit selection signal.
Table k. 2. 5. 2. 10-1 gives the details of the control signals for
multiplication by a power of two, and Table k. 2. 5 .2.10-2 gives the
details
for division by a power of two. The upper left part of each
table entry
gives the shift amount and direction; the lower left part
gives the exponent
adjustment
.
1+ .2.6 The Instruction Set for the Processors
The instruction set for the processors is given in Table U.
2.6-1.
Separate classes of instructions with three, two, one and zero
addresses are
included. An address usually designates a processor register
or memory
location, but no more than one memory address is permitted in
an instruction.
In some special cases noted in Table h. 2.6-1, an address
designates and
operand other than a processor register or a memory location.
The first four operations in the table - addition and
subtraction,
multiplication and division - were covered in detail in sections
U.2.5-2.3,
U.2.5.2.U, and U. 2.5.2. 5 respectively. The AND, OR and
XOR (exclusive or)
logical operation are implemented by using the corresponding
logical opera-
tion of the SNT^S38l arithmetic-logic unit of the
adder (see Table k. 2. 5-1.3-1)
Logical NOT is implemented by using an exclusive OR with
a forced one operand
from a disabled alignment shift network. The MOVE
operations are simple
transmissions of operands from one place to another.
Normalization is dis-
cussed in section h. 2.5-2.1; the integerize operation
is discusned in section
k. 2. 5.2.6. Comparison operation are simply subtractions
which set the condi-
tion flip-flops, but not the operand registers. The
mode setting instructions
use the mode logic of section k. 2.5-1.9- Combinations of
sequences of
153
Address Operation
3
3
3
2
Add
Subtract
Multiply
Divide
Shift
Logical AND
Logical OR
Logical XOR
Move
Compare
Normalize
Integerize
Logical NOT
Round
Set
Move
Options
Round, Normalize, Sign
Round, Normalize, Sign
Round, Normalize, Sign
Sign
Normalize
Exponent source
Exponent source
Exponent source
Register « Memory-
Memory <- Register
Routing pattern *- Register
Register «- Register
Comments
Single & double precision
Single & double precision
Single & double precision
Single & double precision
Multiply by a power of
two
Single
Sign
Single
Single
Sign
double precision,
double precision
double precision,
Sign
Normalize, sign
Sign
Status (i) +• Mode % Status(j)
Mode, Status (i) «- Mode §
Status (j
)
Register * Routing data
Routing data + Register
Routing data -«- Memory
Register * Status
Status 4- Register
Register «-
Set the condition register
Single & double precision
Single & double precision
The "@" sign represents
any one of the sixteen
possible Boolean opera-
tions on two variables.
The two addresses desig-
nate the bit numbers "i"
and "j" which select
amoung the eight status
register bits.
1 Set Mode Mode <- M
1 Route
Set Mode Mode «- 1
CU «- Modes Mode «-
Table-look--up
Single & double precision
Addresses pattern
Table k. 2.6-1 The Instruction Set for the Processors
in the Array
15U
condition states can be stored in the status register of the mode logic, and
provide a simple way to implement complex testing procedures. Several instruc-
tion? include the option to require a particular sign for the result. With a
sign-magnitude representation, absolute value and complementation operations
reduce to simple sign manipulations. The sign logic of section h. 2. 5-2.12.3
permits the normal result sign, its complement, a positive sign, a negative
sign, or the exclusive OR of the operand signs to he assigned as the sign of
the result.
The route instruction supplies a routing pattern address to the
routing network. The network stores sixteen pre-loaded routing patterns.
A routing instruction calls for the use of one of these pre-loaded patterns.
A built-in operand broadcast is also included. It causes an operand in one
of the 256 routing dis-assembly registers to be sent to every routing re-
assembly register. The control unit can load values into the original
dis-assembly register and retrieve value from the corresponding re-assembly
register. See section k.3 for the details of the routing network.
The shift operation permits multiplication or division by a power
of two as discussed in section U. 2. 5-2.10. The power of two is a control
unit operand of six bits in length.
The exponent selection feature of the logical operations permits a
mask to be used for both selecting bits from a fraction and assigning an
exponent value from the mask word to the result. The final binary point
alignment can be achieved by a shift operation.
155
h.3 Processor Intercommunication - The Routing Network
In virtually every problem for which an array processor is suited,
the processors in the array need to exchange data values from time to time.
Indeed, the scope of the problems for which a particular array processor is
suited can depend on the flexibility of its data interchange network. The
data interchange network of this design
-hereafter called the routing net-
work - is a three stage Clos network (Clos, 1953; Benes , 1965). Although
Clos proved that such a network can perform any permutation of the input
signals to the output ports, his proof did not provide a guide to a general
algorithm for controlling the network. This author is among a growing group
of people who would like to have such an algorithm.
The general form for a Clos network is shown in Figure U.3-1, and
the specific form used in this design is shown in Figure U.3-2. The author
is indebted to William Stenzel for many of the ideas which lead to the form-
ulation of the routing network in this form.
The last two stages of a Clos network form what Lawrie (1973) has
called an omega network. In his thesis, Lawrie shows that an omega network,
among other operations, can perform uniform circular shifts of arbitrary
distance and direction. In later work, Lawrie and Wen (1975) have discovered
simple control algorithms for the omega network which permit its use in
partitioned form to perform several simultaneous circular shifts of indepen-
dent amount and direction within the separate partitions. For an omega net-
work such as we have in this design, the size of all partitions must be an
integer power of two, although the partitions may have various sizes. What
must hold for each partition, however, is that with the input ports numbered
156
m
u
o
>
-p
0)
QJ
hO
cd
-PW
I
0)
u
EH
en
O
rHO
0)
VO
0)
EH
H
I
on
0)
•H
157
<o
„, cc
55 <
(0
l—i
as
• • •
M
•H
cn
(UQ
<u
A
-p
u
o
<M
o
• • • >
^^^^
(O
-p
<D
_ UL s
CD x
16
SBAI W
i—
i
Sg -Pi
q: 0)
J u 0)
?H
• • •
w
oH
o
<
CM
1
on
-3-
dj
• • •
f/*
3
*H
&k
CD <
•—
» OQ
CD X C/i
i—
i
O
C£>
158
from zero to N-l, the index number of the lowest numbered input port of a
partition must he congruent to zero modulo the size of the partition. The
Clos network, of course, permits arbitrary partitions, hut we have only been
able to find an algorithm for uniform shifts of one in either direction
within arbitrary partitions. Where other shift amounts are necessary, one
must either conform to the partition restrictions of the omega network and
use the Clos network as an omega network by sending the input operands
straight through the first stage of crossbars without interchange, or make
multiple passes through the general Clos network if non-omega suited parti-
tions must be used.
The details of the interconnections between the crossbars in the
Clos network are given in Figure U. 3-3 for a two stage network of four by
four crossbars. The figure shows the sixteen input ports of the network
divided into four groups of four. The destination number, d, of a lead
from an ouput port source of the first stage, s, is given by
d = (s*N + g) modulo N ,
where all port numbers begin at zero, g is a crossbar number (beginning with
zero), N is the number of input and output ports for an individual crossbar,
and N
k
is the total number of input and output ports of the network as a
whole. Every transmitting switch sends exactly one value to every receiving
switch in the next stage.
U.3.1 Routing Network Control
The following two sections describe the techniques needed to con-
trol the two stage omega network and the three stage Clos network. No hard-
ware is in the design to support run time execution of these algorithms.
159
GROUP
NUMBER
OUTPUT
PORT
NUMBER
INPUT
PORT
NUMBER
d= s* 4 mod 16
d = (s* 4 + l) mod 16
d = (s* 4 + 2) mod 16
d =(s * 4 + 3) mod 16
Figure U.3-3 The Details of Inter-Stage Connections within the
Routing Network
i6o
The crossbar implementation includes a memory to store sixteen four bit . •
routing control words for each data path (the 101U5 of Figure k. 3. 3.1-1).
A path from the data register to the memory input permits the control memories
to be loaded with values computed by the compiler or other software external
to the machine. As we will see in section 6.2, this capability is sufficient
to support the general circulation model and several other algorithms of prac-
tical interest.
h. 3.1.1 Control of the Omega Network
The omega network in this design is composed of two stages of six-
teen by sixteen crossbars. Sixteen is the square root of 256, the total
number of input ports. The destination address for any data value which
enters the omega network from the first Clos network stage is an eight bit
number; the four high order bits are the number of the third Clos stage to
which the value must be sent. The low order four bits of that address give
the number of the output port of that crossbar to which the data value should
be sent. Lawrie (1973) and Wen (1975) have shown that the omega network can
perform all of the following useful data routings within suitable partitions:
1. Circular shifts in either direction of any amount.
2. Uniform separation of a group of contiguous values (unless p, the ultimate
separation distance, is relatively prime to the partition size, P, only
P divided by the greatest common divisor of p and P elements can be
"expanded" )
,
3. Elements originally separated by uniform separations p can be brought to-
gether. Again, unless p and the partition size P are relatively prime,
elements separated by p units distance fail to wrap around, and only P
l6l
Li
XX
V
(/}
o
u
c_>
a
a;
dj
-p
•H
CO
C
<D
•H
XX
W)
3
o
^
XX
Eh
XX
-P
cd
P-,
-P
•H
PQ
a
o
o
Cm
O
•H
o
CD
XX
Eh
m
on
H
&4
162
divided by the greatest common divisor of p and P elements can be
processed.
k
.
3 . 1 . 2 Shifts of One Position in a Clos Network
The argument of this section presents a description of the cases
illustrated in Figure h. 3.1.2-1. Three types of interactions of partitions
with the crossbar switches of the routing network are shown.
As the diagram shows , no more than one value needs to move up from
one switch in the first stage to another in the third stage, and no more than
one value needs to move down from one first stage switch to another third
stage switch. If we send all values which must move up to the top switch in
the second stage and all values which must move down to the last switch of
that stage, we are guaranteed that there will be no more than sixteen such
values, and moreover, that no two such values need to go to the same third
stage switch. Values in partitions like "A", "D" or "E" can be routed
straight through to the third stage, which can interchange them as required.
Only if there are partitions such as "P" or "E" will there be less than six-
teen values which must move up and down. One value from such partitions can
arbitrarily be sent to the top and bottom second stage switches to fill other-
wise unused positions.
This argument is difficult to extend to the case where shifts of
more than one position are involved, for then it is difficult to account
rigorously for all switch positions, and to insure that no second stage
switch recieves two or more values destined for the same third stage switch.
U.3.2 ECL Logic
The choice of ECL current mode non-saturating logic for the imple-
163
END-AROUND
J
B
E
Figure k. 3. 1.2-1 The Possible Interactions of Partitions with
Crossbar Switches
16U
mentation of the routing network was dictated by two factors: first, we
want to be able to route a set of operands through the network in a time
comparable to that of a processor operation, and second, we want to minimize
problems with noise and signal cross-talk in the many cables of the routing
network. The differential pairs of the ECL family, while necessitating
rigorous balancing of line impedances, give - in return - effective isolation
of the ground and signal levels of the driving and receiving logic . These two ad-
vantages of ECL logic over TTL prompted the decision to design the routing
network with ECL logic.
The ECL logic packages used in this design are those in the series
developed by the Motorola Corporation and usually referred to as MECL 10000.
Many other manufacturers provide a second source for these circuits, and the
reference used for the data on 10000 series circuits used in this paper is
Signetics Corporation (19T^A). In logic diagrams, ECL packages are labelled
with their part number, which is uniformly five digits beginning with one and
'
zero,
L.3.3 Routing Network Time and Component Count Estimates
The routing network can be built either as a pure switching system
through which values flow in one step, or it may be built with registers in
each stage so that successive values may flow through it in pipeline fashion.
A third option, not considered further here, is to build one stage of cross-
bars and cycle values through it twice for omega network operations and three
times for Clos network operations. In any case, crossbar switches for less
than the full forty bit width can be built and used in byte serial fashion.
Table U.3.3-1 gives the details of a component count analysis for the pipe-
165
Components
Component Counts
Pipelined Unit Non-Pipelined Unit
Per Bit Per Crossbar Per Bit Per Crossbar
10101 - k - h
10133 \ - - -
101^5 - 16 - 16
10158 - 16 - 16
10161* 2 - 2 -
Totals 16 * 2% * B + 36 16 * 2 * B + 36
Table 1+.3.3-1 Crossbar Component Counts
Clos Network Omega Network
Pipelined
Non-
Pipelined
Pipelined
Non-
Pipelined
Total
Time
Last
Stage
Total
Time
Last
Stage
286 72 2hh 227 72 189
i
L
Table H.3.3-2 Routing Network Propagation Times
166
lined and non-pipelined designs for a sixteen by sixteen crossbar in terms of
the parameter B, the width in hits of the data path through the crossbar.
Table U.3.3-2 presents the propagation time in nanoseconds through various
networks. Its values are derived by consideration of Figure U.3.3-1 which
illustrates the hardware components through which a signal must flow in a
Clos network. (Also see section U.3.3.1.) The total network switching time
and the component count for one crossbar given in Table k. 3.3-3 for crossbars
of all reasonable byte sizes. The expected cycle time of memory for the sys-
tem is nominally 500 nanoseconds. Table U.3.3-3 shows that to keep the time
for one routing step commensurate with this time, either a twenty bit non-
pipelined network, a pipelined Clos network for ten bit bytes, or a pipelined
omega network for eight bit bytes should be built. The component count aspect
of the issue makes it clear that the pipelined design is to be preferred.
The essential steps in the piplined implementation are:
1. Transformation of the data from the parallel form of the processors to
the byte serial form for the routing network,
2. Transmission of the byte serial data through the routing network, and
3. Transformation of the byte serial data back to fully parallel form.
The following two sections discuss the tranformation and transmission aspects
of the routing network hardware.
U.3.3.1 Data Transmission and Broadcasting
The data transmission logic is two or three stages of byte serial
sixteen input by sixteen output crossbar switches. The essential elements of
this network, the crossbar switches, are implemented by the logic of Figure
k. 3. 3.1-1, which shows the logic necessary to implement a one bit path.
167
SN74S195
10125
i
10164
10133
I
10164
10133
1
10164
10133
X
10124
(17) —— CLOCK 1
50
50
50
50
SN74S195
CLOCK 4
i
10 NANOSECONDS
CLOCK 3
10 NANOSECONDS
I
CLOCK 2
10 NANOSECONDS
10125
I
10164
x
10164
1
I
10164
10
SN74S195 17 -«— CLOCK 1
(17)-^- CLOCK 1
50
50
50
50
24 5
SN74S195 17 —- CLOCK 1
286 NANOSECONDS 244 NANOSECONDS
PIPELINED NON-PIPELINED
Figure U.3-3-1 Clos Routing Network Timing Estimates
168
Pipel ined Non-Pipelined
Byte Size
Crossbar
JComponents j
Namoseconds Crossbar
Components
Namoseconds
i
Clos Omega Clos Omega
HO HT6 j 286 227 1316 2kk 189
20 756 | 358 299 676 k&Q 378
10 396
i
502 UU3 356 976 756
8 32U |
i
57^ 515 292 1220 9U0
5 216 !
i
790 731 196 196 1512
Table U.3.3-3 Component Counts and Network
169
u
00
U 0°
two
qOuj
.(O UJ
o <"
ujZjo:1
o
-p
•H
&
w
CO
O
*-.
o
c
cu
en
•H
CO
£1
o
Xi
-p
cd
Pu
-P
•H
eq
cu
a
o
o
O
on
cu
•H
Em
170
00 00
v
ia
10
10
11
00
10
- 10
Figure k. 3. 3.1-2 Broadcasting with a Routing Network
171
The 101U5 storage register shown in the figure stores the control hits for
all eight paths for one of the sixteen "bytes through the crossbar. Three of
the four bits in a control signal select one of eight inputs as the output
of two 1016U eight-to-one selectors whose outputs are wire ORed together.
The fourth control bit, complemented by the 10101 inverter, serves to decide
which of the two selectors is enabled and which is disabled. The 10158 quad-
ruple two-to-one selector permits either local or global control of the
switching path to be selected. The 10133 four bit latch holds the selected
result for the stage; these latches are the registers which permit pipelining
of the byte signals through the three stage network. Thus, each bit switched
through the crossbar requires two 10l6k selectors, one quarter of a 10133
latch and a 1010 quadruple inverter, and one eighth of a 10158 selector and a
IOIU5 register file.
A value from any of the 256 input ports of the routing network can
be broadcast to all 256 output ports using only two stages of crossbars. The
process is illustrated in Figure k. 3. 3.1-2 for a two stage network of two by
two crossbars. The low order part of the address of the desired broadcast
input determines the setting for all first stage crossbars, and the high
order part of that address determines the setting of all second stage cross-
bars.
^.3.3.2 Data Parallel-to-Serial and Serial-to-Parallel Conversion
The hardware which performs parallel-to-serial and serial-to-parallel
conversions resides in the processors as the dis-assembly and re-assembly
logic of Figure h. 5.2. L7-2. This hardware is shown in successively more detail
in Figure 4.3.3.2-1 and Figure 4.3.3.2-2. Figure k. 3. 3.2-1 shows a complete
172
TR0UTE(36.5)
TR0UTE(31,5)
TR0UTE(26,5)
TR0UTE(21,5)
TR0UTE(16,5)
TR0UTE(11,5)
TR0UTE(6,5)
TR0UTE(1,5)
FR0UTE(36,5)
FR0UTE(31,5)
FR0UTE(26,5)
FR0UTE(21,5)
FR0UTE(16,5)
FR0UTE(11,5)
FR0UTE{6,5)
FR0UTE(1,5)
Figure U. 3. 3.2-1 The Parallel-to-Serial and Serial-to-Parallel
Conversion Logic
173
CL0CK2
in
:d
o
a:
TR0UTE(5)
CLOCK 1
TR0UTE(4)
TROUTE(3)
TROUTE(2)
TROUTE(l)
FROM A 10125
FR0UTE(5)
FR0UTE(4)
FR0UTE(3)
FR0UTE(2)
~\
FROUTE(l)
ID
Ixl
\-
Z)
o
cr
TO A 10124
Figure U. 3. 3.2-2 The Details of a Block of the Parallel-to-
Serial and Serial-to-Parallel Conversion Logic
rjk
forty bit dis-assembly and re-assembly register together with its
associated
drivers and receivers. The SN7**S195 four bit parallel in and parrallel out
shift registers are TTL circuits which receive values from the operand
registers of the processor and transmit values to the fraction selector of
the processor. The 1012U differential drivers receive TTL signals from
the
SNTUS195 shift registers, convert them to standard ECL levels, and
transmit
them in differential pair form to the ECL logic of the routing network.
The
10125 differential receivers accept ECL differential signal pairs from
the
routing logic and convert them to TTL levels.
The assembly-disassembly register hardware can be implemented with
fewer components for eight bit byte operation than for ten bit byte
opera-
tion. The discussion of the next paragraph discusses an eight bit
byte
design. The eight bit design requires sixteen SNT^S195 register
whereas the
ten bit design requires twenty. Furthermore, the eight bit
design uses only
four ECL 10000 series components; the ten bit design uses six.
Figure k. 3. 3.2-2 shows the details of one of the SN71+S195 blocks
of Figure k. 3. 3. 2-1. Table k. 3. 3.2-1 lists the eight steps which
are used to
transmit a forty bit value through a Clos routing network in five
eight bit
bytes. In step one, five consecutive bits from the operand
registers of the
processor are loaded in parallel into the SNTUS195's shown using
CL0CK1 and
CL0CK2 in synchrony. The results of step one, taken from
the serial output
pins of the eight SNTUS195's (pin twelve), are available to the
routing net-
work as byte one of the input.
Step two uses CL0CK1 and CL0CK2 in synchrony again to perform
a
serial shift which makes the eight bits of byte two available to
the routing
175
Cycle CLOCKl CL0CK2 Input Output Comments
1 1 1 forty bits byte one Parallel load from
registers
operand
2 1 1 none byte two Serial shift
3 1 none byte three Serial shift
h 1 1 byte one byte four Serial shift
5 1 1 byte two byte five Serial shift
6 1 byte three none Serial shift
7 1 byte four none Serial shift
8 1 1 byte five none Serial shift
Table k. 3- 3.2-1 The Steps in Data Transmission Through a Clos
Routing Network
Cycle I CLOCKl CL0CK2 Input
none
byte one
byte two
byte three
byte four
byte five
Output
forty bits byte one
byte two
byte three
byte four
byte five
none
none
Comments
Parallel load from operanc.
registers
Serial shift
Serial shift
Serial shift
Serial shift
Serial shift
Serial shift
Table k. 3. 3.2-2 The Steps in Data Transmission Through an Omega
Network
176
network; at the end of this step, no data remains in the upper SN7US195 of
each pair. Step three uses CL0CK1 alone to shift the third byte into output
position. At the end of step three, the first three data bytes are in the
register of the routing network pipeline. On step four, CL0CK1 is used to
supply byte four to the network and CL0CK2 is used to receive the first byte
of the routed result from the network. Steps five through eight complete the
routing process. On step eight, CL0CK1 and CL0CK2 are used in synchrony to
accept fifth and last byte of the routed result. Although the design presented
is used with forty bit parallel inputs, it is clear that the technique
described by Table k. 3. 3.2-1, with the addition of one more step which uses
both clocks in synchrony, could be used to transmit data words of up to forty-,
eight bits in six bytes of eight bits each. Because latches and not master-
slave flip-flop are suggested for use in the crossbar switches, clock signals
controlling the flow of data through the network and logic of this section
would probably have to be applied in time starting with CL0CK2 (and for step
eight, CL0CK1 and CL0CK2) of Figure k. 3. 3.2-2 and proceeding in succession
from right to left through the three stages of the routing network of
Figure U.3-2. In particular, CL0CK2 could never be used to both shift a
bit
out for output use and in for input use at the same time.
The seven steps in the data transmission process for a two stage
omega network are given in Table U.3-3.2-2. Because the two stages
only hold
two data bytes in the pipeline, there is no spare step, similar to
that in
the Clos process, so that the capacity of the network is limited to
forty
bits in five eight bit bytes if the logic of Figure h. 3- 3.2-1 is used for
the
parallel-to-serial and serial-to-parallel conversion process.
ITT
k.3.h Table Look Up
A table look up facility is provided within the routing hardware
to support the table look up needs of the model, primarily those of the long
wave radiation calculations. The table look up unit is shown in Figure U.3.U-1.
One table look up unit is included for each of the sixteen routing units. The
hardware includes one processor memory module, an assembly dis-assembly
register like that of Figure k. 3. 3.2-1, four SNTULS193 low power Schottky four
bit counters which form an address register, and four SNT^15T quadruple two-
to-one selectors to determine the source of the memory address. The assembly
register receives data from port one of its corresponding first stage cross-
bar. The dis-assembly register delivers data to input port one of its
corresponding last stage crossbar.
The unit operates in two different modes. In the first mode, each
processor computes the address of the table value which it wants, using integer
arithmetic and the index adder discussed in section k. 2. 5.1.11. The address
for the table entry for processor zero of each first stage routing crossbar
is clocked into the assembly register in two cycles. The data is read from
memory, dis-assembled and sent via the last stage crossbar back to processor
zero. The two address bytes from processor one could be clocked into the
assembly register as the last two bytes of data are clocked out to register
zero. This process continues until all sixteen words requested by the pro-
cessors have been delivered.
The second mode of table look up operation is table loading in
this mode, as initial table address is sent from an appropriate source. In
some cases, the address may be broadcast from the control unit; in other cases,
178
to
tr
LU O
X to
H to
Ld
O ao
ir
0)
-p
•H
CO
ft
D
Ai
o
o
EH
a>
a;
-p
X
•H
CO
-P
Ch
O
<1)
o
H
I
<m
0)
•H
UJ (O
t- O
CO
to
UJ
o
o
cc
a.
5
o
cc
179
an address unique to each table look up memory may be used: it is not neces-
sary that all look up tables have the same contents. The set of processors
can be partitioned by using the routing network to execute several programs
with different table contents simultaneously. The initial block address is
clocked into the register composed of the four SN7^LS193 up-down counters. A
succession of table words from an appropriate source are sent; between words
the storage address is incremented or decremented by one as appropriate.
At this point, a further remark about the logic of Figure 4.3.3.2-1
is in order. If the bit assignments shown in the figure were strictly adhered
to, the eight bit bytes transmitted by the routing network would not correspond
to contiguous eight bit segments of processor operands. In particular, if the
processor is to be able to compute a table address and transmit it in two byte
transmissions to the table look up unit, an input bit order from that shown
in Figure 4.3.3.2-1 is required. Of course, the arrangement of the output bit
assignments can be reordered so that values are transmitted correctly by the
routing network. Suffuce it to say that the input arrangement is arbitrary,
and that an arrangement which supports the needs of efficient use of the table
look up unit can be used without harming the other operational needs of the
routing system.
4.3-5 Communication with the Control Unit and the Input-Output Channel
The routing unit forms the basis for intercommunication among the
elements of the machine as well as with the input-output channel and any pos-
sible future secondary storage. The main function of the routing unit, that
of providing communication paths between the processors, has been discussed
in previous sections. The following two sections discuss the use of the
180
routing unit in support of data flow between the control unit and the proces-
sors, and also in support of data flow between the machine and the perpheral
world envisioned for this design.
4.3.5.I Communication Between the Array and the Control Unit
As we saw in section U.3-3.1, two stages of the routing network
permit a value to be broadcast from any one input port to all output ports.
The control unit can, therefore, send a value to all processors if it can
transmit that value to any one of the input ports of the first routing unit
stage. It can receive a value from any of the processors by accepting a
value from any of the second stage output ports if that value has been broad-
cast to all of those ports by the first two stages of the routing network.
U.3.5.2 The Routing Unit in Support of Input and Output
Data transmission to and from a sequential external device on the
input-output channel can be supported by using the 256 eight bit registers of
stage one of the routing network as a large circular shift register. Informa-
tion to the control unit would enter any stage one input port and be broadcast
to the output port for the control unit in stage two. Information from the
control unit to the channel would flow through the control unit ' s input port
and be broadcast to an output port which is connected to the channel.
For volume data input from a sequential device, successive bytes
can be sent in through any stage two input port, broadcast to the third stage,
and clocked into the appropriate processor assembly register for subsequent
storage in array memory. Volume data output to a sequential device can be
broadcast from the first stage input ports in any desired order to all second
stage output ports. Any one of these can be connected to the channel.
181
Paths from a parallel access secondary storage device - not proposed
for the general circulation model - could be attached to consecutive input
ports of one stage shifted uniformly to the desired position in the next stage.
Although 256 parallel paths are conceptually simpler to deal with, any number
less than that can be accomodated by the joint use of mode and routing control.
Paths to a parallel access secondary storage device could be attached to the
second or third stage output ports, and blocks of data could be shifted to
those ports from either processor or control unit memory.
k.k The Control Unit
The control unit must provide control signals to operate the three
other main components of the design: the processors in the array, the rout-
ing unit, and the input-output channel interface. As we have seen in section
U.3-U, the bulk of the load for input-output control is the task of the
routing unit control logic.
4.4.1 Control of the Processor Array
By design, the processors are simple to control. For each step, a
set of control signals and one clock pulse are all that is required. The ob-
vious control mechanism is a read only memory in which the proper control sig-
nal sequence are stored together with simple hardware to interpret the instruc-
tion stream and send the appropriate sequence of control signals to the array.
The control unit can sample the status of any processor by examining
its mode, condition and status register contents by way of the routing network.
Figure 4. 4.1-1 illustrates the three ways in which the control unit can access
the 256 MODEOUT signals from the mode logic of the 256 processors in the array.
An array of sixteen processors is shown ' in the figure, arranged in four groups
182
10 10 10 (a)
o o o| (b)
1 1 1 j 3
1 1 1
(c)
Figure It. U. 1-1 Reception by the Control Unit of the MODEOUT
Signals
183
of four. In the design, the 256 processors would he arranged in sixteen groups
of sixteen; each four hit group of Figure U.U.1-1 thus corresponds to a six-
teen bit group in the system. The control unit can access the logical OR of
all 256 MODEOUT bits as shown in Figure h. U.l-l(a)
. It can access a sixteen
bit value whose bits represent the logical OR of the MODEOUT bits of the
processors in a sixteen bit group either of ways. In part (b), sixteen
contiguous MODEOUT logic bits are ORed to form one bit. In part (c), the
sixteen bits from corresponding positions in each of the sixteen groups of
contiguous processors are ORed.
Figure 4.U.1-2 illustrates the three ways the control unit can
supply the MODEIN bit to the mode logic of the 256 processors. All 256 MODEIN
signals can be the same, as shown in Figure U.U.l(a). Sets of sixteen pro-
cessors can be supplied with a common MODEIN bit value in the two way il-
lustrated by parts (b) and (c) of Figure U.U.1-2. In all cases, of course,
the MODEIN value can be combined with local control information stored in the
mode register and status register of each processor.
^•^•2 Control of the Routing Network
Control of the routing network - as section U.3 makes clear - re-
quired sequences of synchronized and phased clock pulse interspersed with
shift control and selection signals. Although the precise nature of the con-
trol signals differs in kind from those for the array of processors, the same
technique can be used for the routing network as was used for the processor
array. The question as to whether two asynchronous control devices, one for
the processors, the other for the routing network, would prove cost effective
was not answered before work on the design ceased.
l8U
1 I 1 11 1 1 1 1111 1111 (a)
(b)
1 1 1
1 1 1 1 1 1 1 1 1 1 1
'
—
1 (c)
Figure U . U ."1—2 Transmission to the Processor Array of the MODEIN
Signal
185
5- Design Testing
The multiplier design was tested by constructing a hardware proto-
type, and the floating point addition logic was tested by simulation. The
following two sections discuss these two efforts.
5-1 The Logic Simulation System
Breuer has edited a book on simulation of computer systems, and one
of its chapters (Breuer, 1972) discusses logic simulators. Two classes of
simulation techniques are identified: the compiled code model and the table
driven model. In these terms, the logic simulator described here is a com-
piled code simulator.
In the bibliography for the logic simulation chapter, there are
references to many papers about logic simulation. The larger majority of both
the references and the chapter deals with gate level simulation. The simu-
lator of this paper is a package level simulation. The references uniformly
discuss how their authors constructed simulators; no off-the-shelf simulation
system suitable for package level simulation exists that does not require the
user to write his own package simulation routines. This view was confirmed
by conversation with Dietmeyer (1975). Since the bulk of the work in con-
structing the simulator presented here was exactly that of writing the package
simulation routines, the author feels that no duplication of available material
is represented by the simulator construction effort described here.
Figure 5-1-1 is a diagram of the logic simulation system. The
primary input to the system is a description of logic to be simulated. A pre-
processor accepts this description and produces two items:
186
PREPROCESSOR
IBM 360
ASSEMBLER
IBM 360
ASSEMBLER
IBM 360
LINKAGE EDITOR
LOGIC OUTPUT
SIGNALS AND
TIMING INFORMATION
Figure 5.1-1 Diagram of the Logic Simulation System
187
1. An assembly language program, consisting entirely of macro calls, which
simulates the input logic, and
2. A macro and a macro call which define the structure of a driving module
for the input logic.
Except for a few lines, the macro calls in output (l) above cor-
respond one-to-one with packages in the logic. Each logic function is repre-
sented by a macro which, when assembled, simulates the action of the package.
Some of these macros expand into executable code directly, while others expand
into subroutine calls on simulation modules which reside in a package library.
The macros, not the preprocessor, determine whether a compiled code or table
driven simulator results from the approach described here. Note also that
the complexity of the packages simulated can vary from simple AND, OR level
gates to single packages which perform a full fraction multiplication. Al-
though the set of macros chosen for the particular simulator described here do
not permit it, a package could well be simulation module produced by the
system for a part of the subject logic, so that modular investigation and
debugging of a design can be supported by the technique described here.
Output (2) above consists of a macro called STEP, written by the
preprocessor, which is called by the user of the package. A STEP call results
in one execution of the subject logic with the values for the input variables
given in the call. The only other output included in (2) is a call on the
macro BEGIN with all of the input and output signals for the subject logic as
parameters. Execution of this call begins each execution cycle by setting the
time portion for each input signal to the maximum of the times from the out-
put signals of the previous cycle. Assembly of output (2) together with a
188
handwritten series of STEP calls produces a module which exercises the sub-
ject logic.
By saving the logic object module and the input and output structure
description shown in Figure 5.1-1, the user of the simulation system can
execute the subject logic as many times as desired, having assembled it only
once
.
5.1.1 The Logic Simulator Language and the Preprocessor
Tessler (1968) has defined a single assignment language as one with
the following properties:
1. Every statement is an assignment statement.
2. No two statements assign a value to the same variable.
3. No loops occur which cause the value of a variable to depend on itself.
With the relaxations of the third restriction described in later sections,
this language form is ideal for describing computer logic. The proper order
for execution of the assignment statements depends on the partial order
implicit in them: variables which never are assigned values are input signals
to the logic; variables which are only assigned values and never referenced
are output signals from the logic. All other variables are internal signals.
The first executable statement uses only input signals on its right side,
and defines an internal variable or output signal. The process of selecting
executable statements continues until all statements have been selected or a
loop occurs.
The preprocessor accepts a set of assignment statements which de-
scribe the logic. These statements can be in any order. The topological
sorting algorithm given by Knuth (1968, pp. 258-263) is used to output the
189
lines in a correct order for execution. Loops and multiple definition of
variables are detected.
A line in the input language is an assignment statement which de-
scribes the action of one element (or package) of the logic. An input line
includes the signals which are outputs of the package, the function of the
package, and the signals which are the inputs to the package. Each line be-
gins with a list of the output signals from the package; this list is followed
by a colon. The function name follows the colon and is followed in turn by
a list of the input signals to the package. The line ends with a semicolon.
Signals names must be given to all signals which flow between pack-
ages; each bit of a given named signal maps one-to-one into a wire in the
physical realization of the logic. A signal name is an identifier which be-
gins with a capital letter and is followed by seven or less capital letters
or digits. (The signal name convention of the logic language was also used
in section h for the hardware description. ) The eight character limit is
imposed by the use of the IBM 360 assembler which puts an eight character
limit on the symbol names which it accepts. ) The identifier part of the sig-
nal can optionally be followed by a bit specification. A bit specification
is one, two or three integers enclosed in parentheses and separated by com-
mas, and is required when the named signal consists of more than one bit.
The bits of an N bit signal are numbered from one for the most significant to
N for the least significant bit. A bit specification with a single integer
specifies that bit of the signal which has that integer as its bit number.
In a bit specification with two integers, the first specifies the bit number
of the most significant bit of the signal and the second specifies the number
190
of contiguous bits in the signal. The third integer of a three integer bit
specification gives the difference between successive bit numbers for the bits
in the signal when that difference is not one. Table 5.1.1-1 summarizes the
signal naming conventions.
Signal Name Meaning
A
B(3)
B(l,32)
B(5,M
0(1,2,4)
The one bit signal "A"
Bit three of the multi-bit signal
MB"
Bits one through thirty-two of the multi-bit signal
"B"
Bits five through eight of the multi-bit signal "B"
Bits one and five of the multi-bit signal "C"
Table 5.1.1-1 Summary of the Signal Name Conventions
The individual bits of the signals are the variables assigned by execution
of the lines. The preprocessor guarantees that no bit is assigned a value
more than once, and that every bit which is referenced has been assigned a
value
.
Many packages, such as the SN7HS157 two-to-one selector, have one
output signal. Others, such as the SNT 1+Sl82 look ahead carry generator,
have as many as five output signals. Every line which uses the same package
type should have the same number of input and output signals. The preproces-
sor prints a function usage summary for each package type which lists any
deviations in usage.
Frequently in the logic design described in section U, there was
a need for constant logic one or zero signals. The logic description langu-
age includes the variables ZERO, ZEROS, ONE and ONES as built in variables
191
with the constant logic values which their names suggest. It also happens
that some of the output signals from a package with multiple outputs are
not used. Since the preprocessor questions (hut does permit) the use of a
package with different numbers of output signals in different instances, the
built in variable UNUSED is permitted; its use is encouraged for the sake of
clarity.
The preprocessor also includes two built in functions. The OUTPUT
function prints the values of the input signals written for it as the first
time that all of those signal values are set in a logic simulation cycle; it
appears in the place assigned to it by the partial ordering process. An
OUTPUT statement names no output variables, so that it begins with a colon.
The FORM statement is used to build multi-bit signals from shorter signals.
One instance of its use is to build an eight bit signal composed of ZERO and
ONE bits for input to the SNT 1+S151 eight-to-one selector which supplies the
EXO overflow indication signal described in section h. 2. 5.1.12. h.
5.1.2 Timing by the Simulator
At run time, each named signal which occurs in the logic specifica-
tion is represented by the structure shown in Figure 5.1.2-1. The signal name
left justified in a blank filled eight byte field. The name is followed by
a half-word integer which is used to store the time at which the signal
received its value. The time for multi-bit signals which are set by the out-
put from several different packages is the maximum of the times for all such
package outputs. When knowledge of such time differences is important, multi-
bit signals can be split into several different parts for more detailed timing
information. The bits of a named signals are each represented by a byte; the
192
SIGNAL NAME
SIGNAL
TIME
SIGNAL /• • •/ BITS
8 9 10 11
Figure 5.1.2-1 The Format of the Representation of
a Signal During
Simulation
193
string of bytes which represents the bits of the signal follows the time half-
word. The execution of an OUTPUT function prints the signal name, the bit
specification numbers, the signal time, and the values of the specified bits.
Each package that receives a clock pulse sets the time of that pulse.
In this way, the first possible time at which the clock pulse could occur is
determined.
The following discussion describes the calculation for the value
assigned to the time for the output signal of an SN7US157 two-to-one selector.
The discussion will clarify the nature of the output signal time calculations.
As shown in Figure 5-1.2-2, the SN7US157 has four input signals and one out-
put signal. If the strobe signal is a logic one, the output signal is always
zero regardless of what the selection and A and B input signal values are.
In this case, the time assigned to the output signal is that for the strobe
signal plus the delay time through the package for this case given by Texas
Instrument Corporation (1973). When the strobe signal is a logic zero, the
value of the selection signal determines whether the package output is "A" or
"B". In this case, the time assigned to the output signal is the maximum of
the selection signal time plus its delay and the time of the selected input
signal plus its delay. The time of the non-selected input signal is ignored.
5.1.3 Debugging Aides in the Simulation System
The simulation process for each package includes a test of each bit
of the input operand. Because each bit is represented by a byte of 360 memory,
it can assume more than the two states found in conventional digital logic.
Input signals which are ignored by the package are not tested; thus, the sim-
ulation of an SN7US157 selector does not test the input and selection signals
19^
INPU T A INPUT B
ELECTION SN74S157
SIGNAL
OUTPUT
SIGNAL
STROBE
SIGNAL
Figure 5.1.2-2 The SN7US157 Two-to-One Selector
195
if the strobe signal value is a logic one. It always tests the strobe bit
value.
During the early debugging of the simulator, this testing process
helped to identify the source of the error. The standard simulator response
to an improper bit value in a tested signal is to print an error message to-
gether with the standard output for the errant signal (that is; its name, bit
specification, time and bit values). Logic ones and zeros print as ones and
zeros; improper bits print as dots. The simulator halts and dumps memory when
an error occurs. Although the investigation was not carried to this point, the
simulator could easily be altered, so that it would continue rather than
halting when an improper bit value is detected. This action would help in
designing fault detection programs for the logic, since it would permit easy
determination of the propagation effects of an error. Moreover, it would per-
mit identification and verification of those signals whose values, for a par-
ticular cycle, are of no consequence.
5-1.1* Simulated Packages with No Exact Hardware Analog
In the description of the left operand selection logic (section
k. 2. 5. 1.5), the block in Figure k. 2. 5.1-5-1 represented selection functions
rather than hardware packages. In many cases, simulation results are not
effected, but simulation time is reduced by permitting the simulation macros
to perform package functions in this approximate way. Thus, the macro which
simulates the SNT^S157 two-to-one selector will accept input operand pairs of
any bit length from one to 256, and will produce an output signal with the
corresponding bit length. This deviation from exact simulation does no
violence to the logic function or the logic execution time of the simulated
196
logic
.
5.1.5 Loops
In section 5.1, we referred to relaxations of the restriction on a
single assignment language -which prohibits loops. In real hardware designs,
loops do occur. Three different types of loops are present in the simulated
floating point addition hardware, and they are discussed in the three sections
which follow.
5.1.5.1 Loops and Storage Registers
The value of the zero flip-flop from a previous cycle must he used
to determine the action of the normalization process (see section k. 2. 5.2.1
and Figure h. 2. 5.2.1-2 ) . Another example (which was not simulated) occurs in
the cases of the overflow flip-flop of Figure k. 2. 5-1.12. k-1 and the under-
flow flip-flop of Figure k. 2.'5. 1.12. 5-1- In both of these cases, the previous
value of the flip-flop occurs as a possible input to determine its subsequent
value. The loops which these cases give rise to should be broken by delaying
the execution of the line which assigns a new value to the register or flip-
flop until after all lines which reference the old value have been executed.
Preceeding the output signal name with an asterick has precisely this effect:
a line which contains an output symbol preceeded by an asterick is placed in
the output program after all lines which refer to the named output signal.
5.1.5.2 Apparent but not Real Loops
The logic of the index adder, shown here again as Figure 5.1.5-2-1,
appears to include a loop. The SN7USI82 receives the carry generation and
propagation signals; IXG(l,U) and IXP(l,U), from the four SN?USl8l arithmetic-
logic units, and returns the three carry signals, IXCU, IXC 8, and IXC12, to
CUADDR(13,4)
A(21,4)
CUADDR(9,4)
A (17,4)
CUADDR(5,4)
A(13,4)
CUADDR(1,4)
A(9,4)
IXMODE
197
r— IXCARRY
i-H
00
•—
i
to
z
<o
00
to
z
to
CD
CO
<*
z
to
IXG(4)
IXP(4)
EADDRU3.4)
IXC4
IXG(3)
IXP(3)
EADDR(9,4)
IXC8
IXG(2)
IXP(2)
EADDR(5,4)
IXC12
IXG(l)
IXP(l)
EADDR(1,4)
IXFUNC(1,4)
Figure 5.1. 5-2-1 The Index Adder Logic
C\J
CD
i—
i
(/>
198
three of the SNTUSl8l's. On closer examination, however, we find that the
functions of the SN7USl8l can be partitioned into two separate operations.
The generate and propagate signals depend only on the values of the inputs
A (9, 16) and CUADDR(l,l6) and are independent of the carry inputs IXCARRY,
IXC8, and IXC12. The sum EADDR(l,l6) depends on the input operands and the
carries. The apparent loop is broken in the simulator by implementing the
two separate functions of the SN7^Sl8l (and also the SN7HS381) as two separ-
ate pseudo-packages as shown in Figure 5- 1. 5-2-2. The Sl8lGP package uses
the input operands A(9,15) and CUADDR(l,l6) to produce the generate and pro-
pagate signals for the Sl82. The carries from the Sl82 package are used by
the Sl8l package, together with the input operand values, to produce the re-
quired sum.
Figures 5.1.5.2-3 through 5.1.5-2-8 are the computer output for the
simulation of the index adder. Figure 5-1. 5-2-3 shows the SYSPRINT file
which lists the logic description which was input, and summarizes the
func-
tions used in logic and the signals which are inputs to the logic and
outputs
from the logic. The first seventy-two characters of each input line are
pro-
cessed by the logic simulator. Card input is assumed, and the last eight
columns of each card can be used for card sequence information. The
entire
eighty columns of each input card are listed, and the function summary
lists
the card number of the function card printed. If a function is used
with
different numbers of input or output signals, all cards for that function
are
printed in the function summary. This situation may or may not represent
an
error, and the user can proceed to assemble and execute a simulator with
this
sort of input. The response is completely determined by his macros which
IXCARRY
199
CUAD0R(13,4)-
A(21.4)-
CUADDR(9.4)-
A(17.4)
CUADDR(5,4)
—
f
AU3.4)-
CUADDR(M)-
IXGI4)
IXPI4)
1X6(31
IXPO)
IXGI2)
IXPI2)
IXG(l)
IXP(l)
IXFUNC11.4)
S181GP S182
IXC4
IXC8
IXC12
S181
-EADDR(13,4)
EADDR(9,4)
-£ADDR(5,4)
EADDR(1.4)
IXMOOE
Figure 5-1. 5-2-2 The Apparent Loops in the Index Adder Logic
Removed
200
oooooooooooooooooooooooooooo
_1 fM<*>-*- in>0** e0*O-<'MC"">-*
oooooooooooooo
r\)CMCMcM'MCMCMCM"gevi<MCMeM(Moooooooooooooo
UJ
o
o
X
X
u o
z z3 3
LU JJ OJ
o o o
a o a —
x x x -»
..xxx -
sT ._)
. ^- >»• # z
_ . . . 3
—< r-* —t LL
..Q. X
• « X O LJ o —
-« ~ z z z
—
-f 3 3 => >-
j- . — a. u. u. cc
. -4 ^- x x x ac
_l w . — <
— O -» <->
o Z — <M "O •* X
Z 3 O -J O O •"3 u. x o x x
u_ x — x — — -»
x — — «r
r- -< ji
-. cm (N| —
_ <I
— C£ —
» < •
O O f~
CM X —
J
— o
a x
x —-
-* ^- >»
—i ui-i f>
ef a; cc
O O O
o o o
> < < <3 3 3
i a o o
a. a. a
— oo —
- i/> »
—4 m CO 00 00
• 00 -I -* —•
— * X
(J o
X X
a. a. a.XXX
— of
CC (M 3
o -* o
o <_> <
< X 3
3 — O
u
00 -<
a o ®
O X •*
-1 — oo
00
-<-r »
00 o
X >
.. — CL
0C
-a <
j. LU O
— oo
a. 3 —
x z «r
~ 3 •
.* * j>
. .
-g
_. Lf> —
ro cm <
< < —
-# * <*\
* • —
<
ir\ o> —
a:
of of a
O O OO O «*
< < 33 3 O
o o
-. -j ao
O CO -I
oo 00
O
O Q UJ
UJ UJ O0
oo oo 3
3 3 ZZT.13 3
— — >»
3 3
a. a.
»- h^-N3 3
O O <J> o oXXX
m, Q —
<t UJ <X
—« oo O
O 3 O
X z <
— 3 UJ
oc at ac
O O O
o a a
< < <
UJ UJ UJ
>
CC
cc
<
a
•- a
a
LU <
O UJ
a
x -X 3
— a.
»-
3
a
00
-1 M
<
z 00
19 -)
— <
oo z
oO tart
UJ oo v or
Z cc a
tart - x a
u. 3 < <
UJ a O UJ
o k-
z 33 O
oo
<
Z >
O a o uj cc
>. c£ Z O O
oo < 3 O O
O U. X <
t- X X X 33 tart < .> ta, o
a.
z
* O
* z
_< 3
«. u.
a X • ta
X tart
> *
mm CC »
* CC —
(
« < tart
~* O o
tart X z
o tart 3X u.
tart rtta X
>f tart
J- •
cc e> rtta
Of CM •
< tart ta
o < o>X CM
tart ^* .rt
J" <
CM •
eo ft .rt.
—
i
-rt r
00 tart »
ac eo
•• O rt
o tart
CM «x a
m4 3 o
o o aX 4
tart •^ 3
ao O
CD .rt
o oo a.X <J
~* ••
00
•t O tart
u UJ 00X O0
tart 3
Z
"
a 3 .ta
LU *
00 *m tart
3 * 0.Z » X3 <»> tart
• «rt
O tatf rt*
OC| UJ QC «*
Z O0 o ta.
o 3 a L9
Z < X
k- 3 UJ tart
1-1
z
3
201
# * * * *
a X m* •H »M)
z MM O O O3 mm k • »
u. •> O O OX «M» *M z z z
M4 V >* 3 35 3
mm ^* <^ mm mm QH a LL. LL LL
» >t >t vt 4» Ct * X X X
mm O O < —
)
M M* Ml
^
>• • •» » »o •M mm —
•
O
<
—4 —4
O O
* •>
^ .-4 X •
O O^ CD
tft t> w X
•>
ao
•> P
td
s
•H
a u • »- —
1
U O
X z z z z -» — X X
MM 35 35 3 3 LU • X •MJ •—
1
COW u. u. U- U. O — M — M*
• X X x x g > M» • « O
•H
O
mm •—
• MM » <^ ^
>*• —* mm — — X OT •>» >t N*O 4ft m •ft fc *M «J >»" O O J
•» O •> •
t> >* 4- -t >*• • X •> m4 \T\ (U
<\l — — r- f\l fVJ
4^
» •ft •» • »4" mm -* •* •>
< m ** r- » m < < S M
— <\» c\j rg ^1 «. mm < —
»
•M O
ft * » •> -* <N ^a » •> Jh w
O
< < < < -• •>
O
4-
O
"+h Q
•» » »»o X >t 3 H
ft -~ —
.
"»
-» Z !-• O •> » Ph -H
CO >* •* >t 4» 3 <- » on 0> P> (^H O O OOli » -4 O O (U
ft •» •» » • X mm O • » ^
a rr\ O- \T\ -^ M. 00 • or or B -pQ ~4 O O O w O or
a O a a ^ C
< or ae QC QC <-« -* O < < W) -HO3 O Q Q O "O <- —
»
< 35 3
U •—» O O QQ-< •> >t 35 Ph a>
» LU < < < < • —
»
** >M» mm mm p>
» O 35 35 3 35 r^ >* •> M» •h •> <M > ^ -p
0) -H
^2 ^S
mm O O O O U -• O -^ » ••* mm mm «* mm m* or
a X •mi ^ — — » X ^^ LU Q LU Q LU oar
LU
003
X «. •» •> » <i <—
<
»
X
>
or
or
Q LU O
00 O3 X
LU O
00 O3 X
x <
mm O 1<u to
en -hM >* ro eg «m » » x
z •» O O OO— — M < X Z X Z X » «3 -»• • *> » »>0Q ? •Ml I—
1
35 Ni 3 -» «MJ
mm •* a a. a.a-'iu •> GO «0 <U -H
•> O X X X x » 00 -< ^
-S
mm » I— M M W -4 ^ —
1
>* mm mm mm mm. mm X » EH >
>* O «t >t * <* •& <* M- _<O O * •» * • 35 » » O O O O — -H/
~4 •ft » «•» mm —
• *>• QC ** a. -M * • » * •> •> » •• 1
vO ~t -• «t O m «t m rg -^ Q » O -M Lf> en — or CM
-4 —4 .-* o O O -m. O O Q — • O O ^ •
>*
-t h- ft •» •> < a O •> p • a LTN
-
_J H- h- O O —4 3 OT O O 35 LU X or or or x < X H3 -J J < 3 3 -H a. Q X X X X O 00 M Q M LU <
CL < < Z a a _i _i <NJ 2 Q MM MM >—• >—< M» 3) •M O mm •m* or LTN
z z z & z z - < M- < m >—
<
< •> < < < m »
mm •— — « z 35 z X LU 35 LU LU LUM M4 00 O h- > < m* Q. a. Q. Q. h- m» M> w M mm h- »- or
0£ 00 oO LU O "- —
1
3 0C or O O C3 O 35 3 3 3 00 Co
1 Q <\J Q z a: co > 00 Q. or •-4 •M) (-< —4 —< O. rg CL •M »H <-> a. a. -• •H
- O >t 00 M ODO a: Z < a ao 00 00 00 CO h- 00 h- CO 00 CO p- H- z Q &H
1 < X LL O O or a. MM CC —i *-l i-I •-4 —4 Z5 -^ 35 —4 —
1
1— 3 3 ^ z
5 3 X X X X X < X < X X a. 00 00 00 00 00 O oO O OO 00 00 O O LL LU0>H MM —i — LU — M < I'M
202
it
CC
cc
<.
o
X
1—
t
»
II
<
»
IIO
z >3 cC
u. a:
x <
.-. u
» i—
i
II <*3
Uj •Q «•O -•
z o
X *
>
II cc
cc ccQ <Q O
< X
D —
t_> "-»
U3 Q
a _i
a uj uj
oc i— •-«
o »/> u-
<
U UJ ccZOO
D O Q
u. a: <
x x r>
k4 •-« O
(J iJ O
>* ~< >0
< O O -H
u3 •» » »•
<\l O UJ QC
m z o o
• z> o o
-iliK
» X X Z>
<i — —( o
Q Q Q Q Q
-J _J _i _! Z
LU UJ UJ UJ UJM M M M H
u. u, u. u. uj
<J> o o o </>
(J 'J Ij ij
OQ
<
U
•>
UJ
o
o
zX
o
z
r>
u.X
>•
cc
CC
<
oX
ccQQ
<
>
CC
cc
<
»
O
o
UJ
CD
Q U
Z UJ
UJ -5
o
CC
o
43
CJH
£
is
M
O
-P
ctfH
2
6H
CQ
oHM
O
i-q
0) O
43 «
-p OH
Cm S
O
<D
-P H
3 H
ft fe
4J
2 0)
o 43
-P
o
u S3
V •H
cd
S <3
0)
CM -PH -P
EH •HW h
&
0)
43 W
Eh •H.
i
i
CM
•
LT\
H
IT\
<D
^
3
bD
•H
Pn
203
oooooooooooooooooooo
oooooooooooooooooooo
oooooooooo
m
—
<
X
II
<
o
X
II
<X
o
o
<
o
o
o
z
X
>—
«
» tNJ
O fNJ —
)
ft
-i X o II IIO II II > >»
-< aC UJ OC QC
II O o Q£ QC
o o o < <
z < X. o a3 3 X X X
LL O —* M «
X •> » » »
1-4 —t o ** O -•
•> —4 o o ^-4 —4
O -* o -* o o •^ <-i
II O o II —
<
II O -4
>- II II > II > II II
a: o u a: o a o o
a. z z a: z at z z
< 3 D < D < 3 Z>O LL LL o LL o LL LL
X X X X X X X X
X
CD
TJ
aM
CD
&
-P
0)
>H^
Q
O
-P
d
(L) c,
-P o
-P •H
•H -p
^ aj
^ H
CO B
A, HH GO
Eh
CO Jh
CD
CD TJ
,£ TJ
FH <!
1
1
C\J
Lf>
CD
•H
0.0.0.0.0.0.0.0.0.
LULUUJLULULULULUOO
I— h- hhl-Ht-^KZ
1/1 1/1 l/)l/)(/1(/)l/)l/)(/)UJ
20U
0)
p
<h
O
O 3
a o3 O
"3 —
•
O ."5
O 3
o -•
o 3
z o o
3 O O
^OOOOOOOOOOOOxo3303ooooo3ouoo-oooooooooo
J_ -. -t O 3 3 O 3 3 O O -h o
So^ojooooooo o
X ^J 'AJ f^ .
Z 3 ZQOC "<I J
-3 3 3 < J-OJ-OOC
o «i u.rooa.ooooocj3 xxxxxxxx««J
o o
-< o
o -«
o o3 -«
o o
o o
o o
o -<
o o
o o
o - o
o o o
z o o o
o o
o o
o
coooaoooo-<-'-'0-«
ujOOOOOOOOOO-h-"xoooooooooooouooooooooo o o o
o o
-• o
O -"
3 o
o ->
o o
o o
o o
o -»
o o
o o
o -t o o o
o o o o o
z o o 3 o otSOOOO—jOO
o
o
o
_| —I —1 o —
'
OOOOOO-'-''^^^^^
xooooooooooooijjOOOOOOOO 0000
S22SS3SS3SS2S S22SS3SSS3323
-I3000030000
Z J =Oo: Nt V
-q 5o<t .*• =0 -o *04 LLrooa-ooooor
ai o xxxxxxxx<t«i
°o- 000000 3 O O 3
-^ o zoac N at y
-o oo<» **-Q«
£5 xxxxxxxx<»<3o<---- m '->
o o
-< o
o -<
o o
o -•
o o
o o
o o
o -<
o o
o o
o -< ~> 00000 o ->
z o o o 00ooo-«o-<oo
OsO-O-OOO-*-- ^>
xoooooooo0100000003
a_
l_-i_.300000
aco-'oooooo
~. ~i 3 -«
T5 O O »
^ ir» h- .0
o
•H
-p
H
H
en
p
e
o
u
p
ft
o
0000
o o o a
-p
-1 ~t o ->
O O -I O Ch
z
a:
Z Q
— o
o <
ai 3
00 o <
O OJ X
z a <x3 3 < *
a. ru JioX X X X X X
O 3 O "3
<\l X >
•B-'3t£000*
X X < <
_ ~ _u o
H
aj
M
-p
w
M
H
[X, ^
£ -d
EH <
I
CM
A
u
3
•H
205
o o -•
-> a -*O -4 -<
o o o
o -> ~"
— — o
o o ~*
o o o
^ o o
o -• ~*
o o o
o o o
o —• — o o *+
o o o o -- o
z o o o o o oOOO-.OOOO-i-'-' o-<
ha
X000000"3000
-UOOOOOOOOOOOO
>-—«—ioooooooo-«o
c£0-«0000000000
o
Z O ZO-t MCtV
— o 30< j o-" Q a
O < LLXOOO-OOOOaC
iJ 3 XXXXXXXX<1<T30<—— —•— — ——— '-UU
o o
-i 3
O -1
O O
O ~l
—J
—
I
© o
o o
o o
o -"
o o
o oO -l o o ->
o o — o ozoo —
i
o o
a o o o o o o o
o
o o -«
-< o o
o -< o
o o o
o —' o
—> »-: —
<
o o o
o o o
o o o
o —
<
o
o o o
o o o
o -* —> —' —• o
o o ~i -» -* o
z o o —i —< o oooo-»o-<--oooooa
OOOOO O-inNNMArt
X OOOOOOOOOOOOOJOOOOOOOOOOOO
t- -.~«oooooooo-»o
ao-«oooooooooo
t >
o£ o aj ae
Z O Z O ae r\j aC >
— Q ID < * CO «-4 O, 0£
o< a.xo'Ja-ooooct
uj 13 XXXXXXXX<<SU<MMMWMHMH iJ J
30DD(Oa)flK1000M-<
xoooooooooooo
-uoooooooooooo
-.-•-»oooooooo--o
oco-<oooooooooo
o
at o uj -*:ZO Z O ct <M c£ >-
— q 3 o < * e -< a t
o < uvzooa-oooQac
UJ O xxxxxxxx<i<
CD J<-MM---«M'iJU
a;
-Cp
Vh
O
C
O
•H
P
cd
H
3
ElH
CO
CD
^P
e
o
M
Cm
P
z
o
3
ft
-P
1- 23 Oa
LLI
0)X
ai &
Z p
a. O
.IJ — fM
t- >- o
l/> <
1 -1
H
o — cti u
—1 00 W a>
z Td
LL -p TJ
z a m <
o o
JJ z
cti
X
CO LU a<
<y tJ
,d aH H
CO
1
1
CVI
H
(I)
•H
206
define the package operations. Macros which accept a variable number of in-
puts can be written and used where desired. Figure 5.1.5.2-U shows the assem-
bly program written to simulate the index adder in response to the input shown
in Figure 5.1.5.2-3. Figure 5.1.5.2-5 shows the STEP macro written to facili-
tate control of the logic. Figure 5.1.5-2-6 shows a list of input STEPS which
produced the simulator output shown in Figure 5.1.5-2-7 and 5.1. 5- 2-8.
In the appendix, the complete control card and input set up for the
simulation of the floating point addition and subtraction for the processor
is given. As shown in the listing, the OUTPUT built in function will accept
an integer value in the output field. This value can be used together with
an integer PARM to supress output. Only output with an output number less than
'
the PARM number is printed during simulation.
5.1.5.3 Sequential Logic: Real Loops
An alternative design for the exponent adder, shown in Figure 5- 1-5 -3-1
includes the feedback characteristic of sequential logic. This design was not
used as the eventual exponent adder described in section U.2.5.1.3 because it
is significantly slower than the adder described in that section, and the ex-
ponent adder stands directly in the center of a time-critical path in the
logic. This slower form performs a one's complement subtraction; feedback of
the high order carry is required to compute a correct result. The absolute
value of the difference is produced by SN7US86 exclusive OR gates which com-
plement the one's complement result when it is negative and pass it through
in true form when the difference is positive or zero. The logic was correctly
simulated; the technique used is shown in Figure 5- 1-5- 3-2. In this particu-
lar case, even when the so-called end around carry of the one's complement is
?0T
AEXPU.4) BEXP(1,4) , AEXP(5,4) BEXP(5,4)
t EXFUNC(1,4)
SN74S181
SN 74 S 86
ABSU.3)
SN74S181
SN 74 S 86
ABS(4,4)
EXC2
Figure 5.1.5.2-1 The Sequential Circuit for the Exponent Adder
208
AEXP(1,4) BEXP(1,4) AEXP(5,4) BEXP(5,4)
SN74S181
SN74S181
SN 74 S 86
—V
ABS(1,3)
SN74S181
SN74S181
SN 74 S 86
EXCARRY
EXC2
—
V"
ABS(4,4:
Figure 5.1.5.3-2 The Sequential Circuit for the Exponent Adder
as it was Unwound for Simulation
209
a one, addition of that carry to the difference will not alter the carry. The
ones complement negative zero is complemented by the SNTUS86 exclusive OR
gates. Hence, one pass around the loop always produces the correct result.
The simulation unwinds the loop and expresses it as shown in the figure.
5-1.6 Wiring Lists
An original goal of the logic simulation system was the production
of wiring lists from the logic description for the debugged logic. Work to-
ward this goal was not performed, and the techniques used to avoid loops de-
scribed in section 5.1. 5.2 and 5-1.5-3 make the production of wire lists more
difficult. The use of packages for arbitrary length operands, described in
section 5-l- 1+, adds to the problem of wire list production. The technique of
section 5.1. k is a convenience used to reduce the length of the logic descrip-
tion and speed up the simulation execution. The loop avoidance techniques,
on the other hand, are necessary deviations from an exact line to package
one-to-one correspondence. Another obstacle in the way of wire list produc-
tion is the use of implicit input signals, such as constant logic one inputs
to AND gates which, in physical form, have more input than the particular use
requires. In the simulation of the floating point addition and subtraction
hardware which was performed, several packages which have strobe input sig-
nals like that of the SN7US157 were simulated without providing for this in-
put. The assumption implicit in this practice is that the missing strobe sig-
nal is always to be connected, in the actual hardware, to a logic zero.
All of the cases which appear to cause trouble can be treated in a
simple way except the sequential circuit case. Implicit input signals and
non-standard signal lengths can be easily accounted for. The correct associ-
210
ation of the Sl8lGP and Sl8l pseudo-package can easily be made on the "basis
of the common signals which both share. In the sequential circuit case, how-
ever, different signals names are required by the very nature of the feedback
situation to break the loop brought on by that feedback situation. The
author sought but was unable to find a technique like that of the asterick
notation for register values for such signals.
5.2 The- Multiplier Prototype
The great bulk of the multiplier design described here was done by
William Stenzel and will be described in detail in his master's thesis
(Stenzel, 1975).
The facilities of the Computer Science Department shop limited us
to two-sided boards with maximum dimensions of fifteen inches by eighteen
inches. In practice, these are not confining limits, since we had decided to
use two-sided boards throughout the design, and a fifteen by eighteen inch
board is about as large as one can practically use. The multiplier logic con-
tains ninety integrated circuits which require a complicated data intercon-
nection pattern. With the help of the etched power and ground buss structure
suggested by Mr. Frank Serio, we were able to design and build a one board
multiplier prototype. Power and ground distribution, often the third and
fourth layers of a multi-layer board, were provided by etched distribution
systems. A diagram of the scheme is shown in Figure 5-2-1, and Figures 5.2-2
and 5.2-3 show the artwork for the power and ground systems, respectively.
The thin strips of the buss systems run between the rows of pins of the dual-
in-line circuit packages of the logic. Pins at the appropriate points connect
the integrated circuits to the power and ground distribution system. The
211
/—
DUAL IN-LINE
PACKAGE
1!
/////////////A
!
CONNECTION ETCH
'////////////,
7/ /ON THE BOARD
H/////////rA -<
)
-
..
\
PRINTED
CIRCUIT BOARD —
CDAI I K r\ \Jrr INSULATING TAPEortuur»u VOL.
Figure 5.2-1 Details of the Power and Ground Bussing System
t1 t1 ™r
"1™ T"
T" T1 X
T" ™T T
"H T ™r
™p
i t
T T T
"T T "T
Figure 5.2-2 Power Distribution Artwork
K>
-L <*m mL ~
= = = =
L L J- tm /' ' 'J
1 1 1 1 1 1
mU
1 1 1 1 1 1
i r T" ^
L mL jL ii mL
i 1 T T1
L «i iii _i, j_ i 1 1 "T HH
1 I !
Jb «• ^^i
i 1
"1^™
nr t ti
_L. I 1 I
i ! ! 1 TT
1 1
i i nr t
il H^LHki hLuiL
TTT T"T j
Jn JL J uLJi
! "T T1 T
L [ ! 1 J
"l 1 1 ! 1
! i r t™ t
"i i i i i
1 1 1 1 T
i I i i T1
t i t
Figure 5.2-3 Ground Distribution System Artwork
2lU
etched circuits of the system are insulated and attached to the board by
insulating tape.
After several interations, Ms. Stenzel decided on a board layout
which places the integrated circuit components in a horseshoe arrangement at
the periphery of the board with the input lines running up the center of
the component side of the board and the output signals running down its out-
side edges. The component and solder sides of the resulting board are shown
in Figure 5.2-1+ through Figure 5-2-7-
The sum of the maximum operating times of the integrated circuits in
the multiplier logic is 26U nanoseconds, and the sum of the typical operating
times is 189 nanoseconds. Several stages of testing and refining the ground
transmission by the cabling have shown that the multiplier will operate reli-
ably at cycle times as low as 200 nanoseconds. The original cables which
pro-
vided the input to the board and received its output were twenty-six conductor
flexible ribbon-type cables. Twenty-four conductors of each of four cables
were used to transmit the twenty-four bits of each of the two input operands
and the forty-eight product bits. To obtain satisfactory time and noise
per-
formance from these cables, we found it necessary to shield each of them
with
copper tape ground planes. Therefore, we feel that the eventual system
should
use nothing less than cabling which will transmit interleaved ground and
signal pairs between boards.
215
J
Figure 5.2-U Multiplier Prototype Board; Component Side
Figure 5.2-5 Photograph of the Component Side of the Multiplier Board
217
Figure 5.2-6 Multiplier Prototype Board, Solder Side
218
Figure 5.2-7 Photograph of the Solder Side of the Multiplier Board
219
6. System Performance
This group of sections will evaluate several aspects of the perform-
ance of the machine. In the first section, we will discuss the execution on
time for operation cycles of the processors with information derived from the
logic simulation work. The other sections will evaluate the effectiveness of
the design for the weather model, matrix inversion, image data processing and
information retri eval
.
6.1 Processor and Routing Unit Cycle Times
The simulator indicated that the time for a floating point addition
or subtraction was 256 nanoseconds. Two selector stages and the operand
registers, all of which are in the operation cycle for the complete processor,
were not included in the simulation. Inclusion of these elements would in-
crease the time measured by the simulator to 336 nanoseconds. This figure
represents the sum of the maximum propagation time through the logic elements.
As the experience with the multiplier has shown, it is not unreasonable to
expect this time to be achievable. On this basis, we estimate that a reason-
able operation cycle time for the processor logic is 350 nanoseconds. The
logic description of the processor given in section h did not- include any
extra logic to reduce the cycle time for frequently occuring special cases.
Replacing the fraction selection logic of Figure k. 2. 5.1.7-2 with that shown
in Figure 6.1-1 removes the adder and the left and right operand selection
gates from the path taken by normalization and multiplication results. The
simpler but slower design assumes the use of one constant clock frequency to
control the operation cycle of the processor. Adding extra paths implies the
need for different operation cycle times, so that more complicated clocking
2 20
SUM(25,8)
FROUTE (25,8)
1
PR0T(?S'8)
SUM(25,8)
SUM(21,8)
STATUS (1.8)
FROUTE (17, 8)
PROD (in, 8)
PROD(i7,8)
SUM(i7,8)
SUM(i3,8)
(C,Z,SIGN,0,U)
FROUTE ( 12, 5)
PROD( 36,5)
PROD(i2,5)
SUM(i2,5)
SUM(8,5)
SUM( 5 ,7)
FROUTE (5,7)
PR0D(29,T)
PROD( 5 ,7)
SUM(5,7)
SUM(i, T )
SUM(U)
FROUTE (1+)
PROD (29)
PROD
(
k
)
SUM(U)
1
FROUTE (2, 2)
PR0D(26,2)
PR0D(2,2)
SUM (2,2)
1
FROUTE ( 1
)
PROD (25)
PROD ( 1
)
SUM(l)
1
H
H
CO
co
CO
-\J
H
\s\
H
CO
CO
CO
~\J
H
UA
H
CO
J-
CO
i_r\
o
H
CO
CO
^»J
rH
H
co
CO
~\J
H
Lf\
H
CO
-3"
t—
S3
CO
CM
"~1
—
'
H
U^
H
CO
J-
CO
FRACT(25,8)
FRACT(17,8)
FRACT(12,5)
FRACT(5,T)
FRACT(U)
FRACT(2,2)
FRACT(l)
Figure 6.1-1 An Alternative to the Fraction Selection
Logic
221
logic would be required. The increased complication occurs only in the con-
trol unit, however, not at the processor level. A complete analysis beyond
that permitted by the information we now have about the system is required to
decide how cost effective such enchancements would be.
Few of the arithmetic operations which the model will actually use
can be performed in only one processing cycle. All normalized results require
at least two cycles. A normalized multiplication will probably require three
cycles unless a logic enchancement like that mentioned in the previous para-
graph is used. On the other hand, the compare, normalize, integerize and all
of the move operations will take only one cycle.
Work which was not completed was to have experimented with proto-
type routing hardware. The results of this work would have provided a basis
for estimating the operation time of the routine network. The principle un-
known factor in this part of the design is the time required to send the sig-
nals through the cables connecting the switches in the routing network. In
section U.3.3, we estimated the times for the routing unit by assuming cable
transmission times of fifty nanoseconds. The estimate given there for the
operation time of a pipelined unit with eight bit paths was 5^2 nanoseconds.
This estimate will have to stand, since we have no information about the
actual behavior of a prototype for this logic.
6.2 Performance of the System on the General Circulation Model
There is no subroutine of the general circulation model which is
small enough to serve as a reasonable test case for timing estimates. The
only parts of the model for which 360/95 times are available are the large
C0MP1-C0MP2, C0MP3, and the radiation subroutines. The subroutines C0MP1 and
222
which form the core of the model, exist as two separate subroutines only be-
cause the logical unit which they form is too large for complication by the
IBM FORTRAN H compliler (Karn, 197*0- Evidence for the applicability of the
array computer architecture is found, however, in the results of the effort
by GISS to run their model on the ILLIAC IV, (Karn, 1975) which are presented
in Table 6.2-1. The table shows the ratio of ILLIAC IV to 360/75 processing
times for three parts of the model. During the time these figures were mea-
sured, the extensive facilities of the ILLIAC IV control unit, which are in-
tended to speed instruction decoding and overlap the execution of parts of
array instructions, were disabled; this accounts for the relatively low ratio.
With all of the features of the control unit operational, these ratios should
all increase by a factor of three. The poor performance of ILLIAC IV on the
radiation routine is a direct result of the fact that the 3000 word table
which is used by this routine had to be distributed across the memories of
all sixty-four processing unit memories in the array. As a consequence,
table access by a processor to a particular table value was very time con-
suming. This very result prompted the inclusion of the table look up facili-
ties in the current design. The last line of the table gives the performance
figures for a new radiation algorithm designed for use on parallel machines.
It uses more computation and less table space, so that - on ILLIAC IV - the
required table can be stored within the memory of every processor.
Rather than attempting a timing exercise for the model on the de-
sign, we will present an analysis of the efficacy of the routing network in
supporting the data communication needs of the model. Figure 6.2-1 is a
schematic representation of the grid of the general circulation model. Each
223
Code
Segment
360/95
Time
(seconds
)
ILLIAC IV
Time
(seconds of CPU
time only)
Time Ratio
C0MP1 12.78 2.36 5.U2 : 1
C0MP3 6.5U 1.5U U.25 : 1
Radiation
(Large Table)
57-90 187.65 1 : 3.25
Radiation
(Parallel algorithm)
***** 33.00 1.76 : 1
Table 6.2-1 Relative Timing of the ILLIAC IV and 360/95 Models
22U
w —
-
-•- E
TROPOSPHERE
SURFACE
Figure 6.2-1 A Schematic Representation for the Grid of the
General Circulation Model
225
spherical shell is shown as a rectangle. The north and south edges of each
rectangle represent the north and south poles at the various vertical levels.
Figure 6.2-2, based on Arakawa (1972), Tsan (1973) and Mintz (197U), shows the
types of interactions between points of the grid which occur in the model.
The interaction of the vertical levels is very simple. All of the horizontal
interactions require simple access to one neighboring value (or a sequence of
these operations) except the case which requires that the set of polar values
be averaged to produce one common value.
The horizontal averaging shown in the figure is required to over-
come the effect of the convergence of the meridians at the poles. If the
Courrant stability condition - cAt < Ax - (Fox, 1961 ) which relates the maxi-
mum velocity to the inter-grid point spacing would require a very small time
step over the entire grid for numerical stability. All models violate this
condition, and use a larger time step than the small polar inter-grid distances
permit. The resulting instabilities in the polar regions are removed by
averaging several meriodnal values; the number of averaging iterations increase
as the latitude approaches the polar regions. This zonal smoothing occurs
even in the split grid model, although to a lesser degree. Because of this
zonal smoothing, there is a clear inherent preference for parallel computation
on circles of constant latitude. This approach is the best way to maximize
the efficiency of the computation by maximizing the number of processors
actively contributing to the results at any time.
For the next decade, GISS will be interested in models of two dif-
ferent horizontal resolutions (Halem, 197*0 • Both models have fifteen verti-
cal levels. The two horizontal resolutions are:
226
Horizontally:
i,j j+1, J
]i+l, j-l
i, j-1
i,J i+1, J
T i, J
i-1, J i, J
i
i> j-l
X 1-X
i-1, J
Pole Special Case
NS
I
i, j+1
i, J
C> i-1, J
EW
i, J
« i, j--1-
AVRX horizontal averaging
i, j i+1, J
£ of all values on the pole latitude "circle"
Vertically:
l+l
Figure 6.2-2 The Ranges of Interactions Between Points
in the Finite Difference Grid
227
1. a model with 128 points around its equator and ninty-six circles of lati-
tude, which we will call the 96x128 grid, and
2. a model with 256 points around its equator and 192 circles of latitude,
which we will call the 192x256 grid.
In the next two sections, we will discuss the two primary variations of the
model: the UCLA rectangular model and the Giss split grid model. A third
section will discuss the common problem of computing the average of all polar
values.
6.2.1 The Rectangular Model
In this model, all latitude circles have the same number of points.
The 192x256 grid fits the machine very well; the entire array is treated as
one circle of size 256. All of the processors are always fully employed. For
the 96x128 model, the array can be treated as two circles of size 128. Four-
teen of the fifteen vertical levels for a given latitude can be processed in
parallel in seven cycles. One level from each of two different latitude lines
can be processed in an eighth cycle, so that two complete latitude circles can
be processed in fifteen computation cycles. In high latitude regions, half
of the processors will be inactive during part of one of these cycles while
the other half complete the extra zonal averaging steps required at the higher
latitude. The machine will be very efficient for these models. Only shifts
of one position left or right are required for east-west communication. An
occasional shift of 128 positions is required for north-south communication
in the 92x128 grid. All of the required shifts can be accomplished by the
omega network in one routing network pass.
228
6.2.2 The Split Grid Model
We will discuss two different techniques for the split grid model.
In the first of these, points deleted from the rectangular grid will be used,
and missing points will imply unused processors. Figure 6.2.2-1 shows one
rectangle of the resulting grid for the 96x128 model. To retain contiguity
of values on the same meridian, points are stored with increasing separation
between active processors as the latitude increases. Table 6.2.2-1 shows how
the number of split grid regions - regions with the same number of points on
a latitude circle - increases as the horizontal grid is refined. Table 6.2.2-2
shows a possible distribution of latitude circles of the various sizes which
occur in the 96x128 and 192x256 grids.
Meridians at
the Equator
72
128
256
512
Table 6.2.2-1 The Number of Split Grid Regions for Various
Model Sizes
Just as in the rectangular model, the 192x256 grid uses the processor array
as one circle of size 256, and the 96x128 grid uses two circles of size 128.
In the rectangular model, a uniform shift of one position was always required
for east-west communication. Hence, however, shifts of from one to as much
as thirty-two positions (for the eight point high latitude circles in the
192x256 grid) are required. North-south communication in the 96x128 requires
an occasional shift of 128 positions as before. All of the required shifts
are supported by the omega network included in the routing network in one
routing cycle.
Numb<?r of Split
Gr:id Regions
5
7
11
15
229
8
8
16
32
16
16 points/circle; one stores every eight memories
32 points/circle; one stored every four memories
61+ points/circle; one sotred every second memory
128 points/circle; a point is stored in every memory
-^ 128 »»
96
Figure 6.2.2-1 One of the Vertical Level of the Rectangular
Mapping for the 96 x 128 Split Grid Model
230
192 x 256 96 x 128
points per number of points per number of
latitude circle such circles latitude circle
16
such circles
8 k 8
16 k 32 8
32 8 6k 16
6k 16 128 32
128 32 6k 16
256 6k 32 8
128 32 16 8
6k 16
32 8
16 k
8 k
27328 points P<sr 6912 points P<sr
variable pe r variable per
level level
Table 6.2.2-2 Distribution of the Various Sizes
of Latitude Circles for one Level
In each of the split grid sizes, fifty-six percent of the processors
are occupied by data. This seeming loss of efficiency is more than repaid by
the fact that the time step for the split grid model is at least twice that
for the corresponding rectangular model.
The second approach to the split grid model uses latitude circles of
size sixteen through 128 for the 96x128 model and eight through 256 for the
192x256 model as indicated by Table 6.2.2-2. All shifts of data to support
east-west communication in this approach are shifts of one position. For most
cases, north-south communication requires a shift between different latitude
circles by the size of the circles involved. For example, when the array of
processors is treated as a collection of circles of size eight, an eight posi-
tion shift which treats the array as one circle of size 256 will facilitate
north-south communication. The exception noted above occurs when communication
231
between circles of different sizes must occur, as it must at split grid region
boundaries. For these cases, an omega network expansion or contraction of
interprocessor distance will suffice. How much of the potential gain which
this approach stands to provide over that of the rectangular approach can
actually be realized cannot be predicted at this time. Clearly, this second
approach to the split grid model would be more difficult to program.
6.2.3 The Polar Circle Sum
In all forms of the model, the poles are represented by a full lati-
tude circle of points whose values are computed and then averaged. In hard-
ware terms, values from each processor in a partition must be averaged. The
standard technique for this is the so-called log sum technique. Progressive
shift and add steps produce the sum of 2 values in 2 contiguous processors
in N-l steps. In the first step, all values are circularly shifted one place,
and the routed value is added to the stationary one. The sum is then routed
two places and added to the previous partial sum. Successive routing distances
double, until, in the final step, a shift of 2 " places occurs. In the rec-
tangular and compressed split grid model, the first shift is by one place; in
the rectangular split grid model, the first shift is by thirty-two places for
the 192x256 grid and by eight places for the 92x128 grid since the initial
values are separated by these amounts initially.
6.2.1* A Hardware and Time Comparison of the Clos , Omega and Nearest Neighbor
Routing Schemes
The routing network described in section k.3 requires an assembly-
disassembly register in each processor and either two or three crossbar switches
for each sixteen processors. Each assembly-disassembly register requires
232
twenty components, and each crossbar for an eight bit path uses 32H components.
The Clos network scheme uses four cables per processor. One of the cables
goes from the processor to the routing network, one goes from the routing net-
work back to the processor, and the remaining two cables connect the stages in
the three stage Clos network. An omega network uses only three cables per
processor.
The nearest neighbor scheme of the SOLOMON and ILLIAC IV requires
four cables per processor, assuming - as is true to date - that bi-directional
ECL differential cables are not feasible. In any case, four sets of line
drivers are required in each processor. To provide the vital broadcast input,
a fifth cable and five sets of line receivers are required in each processor.
The broadcast operation which permits the control unit to access a value from
any of the processors must be included with added hardware if this function
is desired. Moreover, some additional hardware is needed to support the input
and output needs of the array of processors.
Ignoring anything but the nearest neighbor and broadcast connection,
a fully parallel system would use seven six bit registers, four sets of ten
quadruple line drivers, five sets of ten quadruple line receivers, and forty
eight-to-one data selectors per processor. A byte serial scheme is much more
economical. Each processor would have to have an assembly-disassembly
register
four sets of line drivers, five sets of line receivers, and a byte's
width
number of eight-to-one data selectors. Table 6.2.U-1 summarizes the
component
counts and transmission times for the various options.
The nearest neighbor routing network permits only one and sixteen
position uniform shifts in a 256 processor circle. Partitions of that circle
233
Components Transmission
Routing Scheme for each Time in
Sixteen Nanoseconds
Processors
Eight Bit
Clos Network 1292 57^
Eight Bit
Omega Network 970 515
Parallel Nearest
Neighbor Network 2192 91
Eight Bit Nearest
Neighbor Network 736 ^55
Table 6.2.4-1 Component Counts and Times for the Three Possible Routing
Schemes
2^)j
are not supported. Expansion and contraction for connecting split grid regions
stored compactly are not supported. The omega network supports all of the
partitions and shifts required by the general circulation models discussed in
this paper. Shifts of any distance and direction within the permitted parti-
tions are all accomplished simultaneously in one pass through the routing net-
work. Only shifts of one and sixteen positions take one pass with the nearest
neighbor scheme.
It is clear from the above comments that the nearest neighbor routing
scheme finishes a distant third in the three way race for inclusion as the
routing scheme. Whether the Clos or omega network should be used depends on
the control algorithms available when an implementation is undertaken, and the
routing requirements on the machine which is being built. The Clos scheme uses
thirty-five more components per processor than the nearest neighbor scheme,
and the omega network uses only four more components per processor than the
nearest neighbor scheme.
6. 3 Image Data Processing
Results from the research conducted by a group led by Robert Ray
(197I+ ) has shown that the ILLIAC IV is an efficient computer for processing
multispectral image data from the Earth Resources Technology Satellite (ERTS)
experiment (George, 1971). The initial stages of Ray's work have produced
ILLIAC IV implementations of the data clustering (Thomas, 191W . These
algorithms were adaptedby the Laboratory for Application of Remote Sensing (LARS)
of Purdue University (Wacker, 1970) from the ISODATA algorithm of Ball and
Hall (Ball, 1965). These algorithms, originally developed for use with air-
craft multispectral scanner image data, have been successfully applied to simi
235
lar data collected "by the ERTS satellites.
The ERTS satellite measures solar energy reflected from the earth's
surface; four different spectral hands of reflected energy are measured for
each point. The data is processed in terms of frames which contain T.T(lO)
(32^0 times 23^0) points each. Since each point is represented by values of
reflected energy in four spectral hands, each frame of ERTS data contains al-
most thirty-one million small integer values.
The LARS technique has two steps. The first step uses manually
selected areas to compute "spectral signatures" for known terrain features.
The statistical characterizations so determined are then applied to large
areas of interest to estimate the extent and amount of terrian with features
like those in the training areas. These two steps, called clustering and
classification respectively, are described in the following two sections as
potential applications of the machine design presented in this paper.
6.3.1 Image Data Clustering
The ERTS data for a given point (an area of approximately 1.1 acres)
consists of a vector of four spectral energy measurements. The objective of
the clustering algorithm is to partition the data in the test region into M or
less spectrally dissimilar classes. Iteration of the steps in the algorithm
continues until the M clusters of the initial data are determined. Each
cluster is characterized by a mean of its four dimensional spectral data points
and a four by four symmetric covariance matrix.
The algorithm is described in detail in the following text together
with comments on how the machine design of this paper would be used to implement
the algorithm.
236
The entire set of 256 processors is used in concert during the
clustering algorithm. The initialization steps in the algorithm determine
initial mean and standard deviation vectors for the set of data points.
A given data point is represented by a four element vector,
y = (x X , X , X, .). The initial four means,
i l,i 2,i 3,i h 9 x
N
m. = ^
E
.
X.
., j = 1, 2, 3, k,
j N 1=1 i,j
are found for the complete set of N data values. The algorithm should dis-
tribute the data points uniformly across all 256 processors of the array. The
summation process begins with a loop which adds all values within each proces-
sor and ends with a log sum step (see section 6.2.3) across all 256 processors.
The initial value N is broadcast. The four means, recovered by the control
unit through its port to the routing unit, are broadcast to permit computation
of four initial standard deviation values:
2 1
N
, ^
s. = f- n E X. .
- m ) .
J N-l i=1
i,j J
The cartesian product of the four real line intervals,
I =[m - s., m. + s ] i = 1, 2, 3, U,
J J J J
defines a rectangular parallelapiped which should contain most of the sample
points. The M initial cluster centers are chosen to be uniformly spaced along
a diagonal of this parallelpiped, and all M values are computed and stored
by
each processor. The algorithm iterates the following two steps to determine
M final cluster centers.
Step one determines the eucludian distance between each point and
237
each of the M cluster centers. Each point is assigned to the cluster with
the nearest cluster center. This calculation takes place without any inter-
processor communications.
Step two computes new cluster centers by using the means of the vec-
tors in each cluster. If no vector changed clusters in step one, the algorithm
terminates. A change of cluster is determined by using the processor mode
sensing hardware described in section U.U.I.
The result of the clustering process is M four element cluster cen-
ters and M symmetric four by four variance-covariance matrices. The elements
of these matrices,
ciW 4 <v-v < x*,j-v i.j-1.*. 3, *,
and the number of vectors, P, within each cluster are computed by intra-proces-
sor summation followed by log sum steps for the entire processor array.
6.3.2 Image Data Classification
The clustering algorithm determines a cluster mean and covariance
matrix for each of M clusters which it identifies in the data for a selected
set of ERTS data. The classification algorithm uses these two paramters for
each of the M classes and, for each point of the data being classified, computes
the probability of class membership for each of the M classes, and assigns
each point to the class for which its probability of membership is highest.
The probability function, based on the assumption that the distribution func-
tion is multivariate normal, is
P.(X) = b. - h [(X-M.)
T
ClT
1
(X - M.)], i = 1, 2, ..., M.
238
The terms in the probability function are:
X: a four component vector of ERTS data,
M : the four component mean vector for class i,
i
C : the four by four covariant matrix for class i, and
i
b.: -\ log | C"
1 |(Fu, 1968).
The constants b and the covariant matrix inverses are computed by a step in-
i
termediate to the clustering and classification steps. These constants may be ;
used in several classification steps.
In the following two sections, we discuss two different ways to
organize the execution of the classification process.
6.3.2.1 Classification by Routing Point Values
In this shceme, we partition the array of processors into circles
of size M, the number of data clusters or classes. One processor in each par-
tition is loaded with the constants for one data class. Considerable flexi-
bility is provided by this approach. For example, several different sets of
data class can be applied to one set of ERTS data by using different input
constants in different partitions. The input ERTS data can be distributed
across the partitions as desired. If only one set of classification constants
is used, the input ERTS data can be uniformly distributed across the array
of
processor memories. Within each partition, M points at a time (plus a class
number and probability value) are routed circularly around the M processors
in the circle one step at a time. The probability that a point lies in
a class
is computed by the processor which stores constants for that class and
the
class number; the probability ofihe most likely class and the four spectral
values
239
are forwarded around the circle. When the M steps for each M points have been
completed, each of those points has been assigned to its proper class.
This scheme makes full use of the Clos routing network; circular
shifts of one position at a time are all that the scheme requires, and arbitrary
class sizes are facilitated. Unless M, the number of classes, is a power of
two, there will be inactive processors. If M is a power of two, the omega net-
work will support the algorithm.
6-3.2.2 Classification by Broadcasting the Class Constants
In this scheme, the ERTS data is uniformly distributed across the
256 processors and their memories. The sets of constants which describe the
classes of interest are broadcast by the control unit for storage in the pro-
gram memory. Classification with respect to several sets of classification
parameters can be performed by broadcasting the several sets of classification
constants. In this scheme, there need be no inactive processors. Each cycle
in the classification process requires fifteen uses of the routing network to
broadcast the ten values for the symmetric covariance matrix, the four class
mean values, and the constant "b" term for each class. The previous scheme uses
the routing network six times in each step. The degree of independent (that
is concurrent) action permitted by the control unit for the processor array and
the routing network will determine which of the two schemes is to be preferred.
^•3.3 Byte Packing and Unpacking
The ERTS data, measured by photosensors and converted to digital data
hy the satellite, consists of many small integer values: each spectral measure-
ment is converted to a six bit value. Moreover, the classification process
assigns each point to a class which can be represented by a small integer. Thus,
2^0
for efficient use of the input and output facilities of the machine, it is im-
portant to "be able to unpack several small integer values from one word of data,
and to he able to pack several small computed values into one data work.
Figure 6.3.3-1 illustrates how four ERTS values for one point can be
packed into one word for input and unpacked for use by the machine. Part (a)
of the figure shows the four bytes packed into the twenty-four bit fraction of
a data cord. Part (b) shows the result of an AND operation with a mask which
selects value three and assigns it the exponent value plus four.
Because the exponent radix of the machine is sixteen, the binary
point can only lie between four bit digit positions; for value three, this
means that the binary point is placed within the value, not at its right end
2
where it belongs. A multiplication by 2 - that is a shift operation - results'
in a non-normalized integer value with the correct exponent value and with the
binary point in the correct position as shown in Figure 6-3.3-l(c).
Figure 6.3.3-2 illustrates how a small integer value is packed into
the desired position of a data word fraction. The initial integer value, a
full word as shown in part (a) of the figure, is added to the constant shown
in part (b) with a floating point non-normalized addition. The result of the
addition is shown in part (c) of the figure. The arrows in part (b) and (c)
of the figure indicate the position of the binary point. The value is alligned
2
by a "shift" of two places - division by 2 - which yields the result shown in
part (d). The final step ANDs the part (d) result with a mask. A final step
to OR this result, shown in part (e), into a data word with other packed values
is not shown in the figure.
2Ul
'WMM, i 2 3 4 (a)
44
T
— (b)
44 (c)
t
Figure 6.3.3-1 Unpacking Data Values
2U2
41 7 (a)
44 1 (b)
44 8 7 (c)
44 2 1 C (d)
44 7 (e)
Figure 6.3.3-2 Packing Data Values
2i+3
6.h File Processing and Information Retrieval
In this section, several examples of file processing and information
retrieval will illustrate the capabilities of the machine for this class of
problems. The first example concerns file comparisons to determine statistics
about pairs of similar files including how a large file can be efficiently
sorted. A second example shows how information can be retrieved from a file
with the machine.
6.U.1 File Statistics
Post processing of weather model data frequently includes comparison
of two files of data taken from two model runs with slightly different starting
conditions. Average differences between various parameters are sought. Two
such files can be read into the memory of the machine and compared 256 points
at a time. If the average difference between two temperatures is sought, for
example, 256 sums of pointwise differences within the 256 processors can be
quickly computed. A final sum of the 256 partial sums can be computed by an
eight step "log sum" which adds values routed by one, two, four, eight, . . .,
128 positions. Eight such steps, the log to the base two of 256, produce the
sum of all the pointwise differences which was sought. Each one of the 256
processors contains a copy of the same value at the end of the process.
If a distribution for the differences is sought, each processor can
compute and sort all differences for the points which it holds. Then a 256 way
merge of the 256 sorted lists of differences can be performed by an eight step
comparison process which determines the smallest of the 256 locally smallest
values, for example. At the end of the process, all 256 processors contain the
same smallest value. The number of occurances of the value can be determined
2l+U
by a log sum of the number of occurances of the value in each of the proces-
sors; the log sum result will also he held in each of the 256 processors at
the end of the log sum process. Hence, a sorted list of pointwise differences
together with a count of their individual frequencies can he easily extracted
by the control unit using its connection to one port of the routing network.
If an approximate distribution is sought, the interval of interest can be
divided into sub-intervals and a log sum of processor computed counts of values
which they hold which lie in the broadcast interval can be performed.
6.U.2 Information Retrieval
In this example, we suppose that the files of a computer dating
service are stored in the array memory. Since this example is included to
illustrate machine functions, no indices for the file are assumed. The raw
data records of the file are used. Let us suppose that a young customer
wishes to locate all girls which meet the following characteristics:
EYES: (green or blue) and HAIR: (blonde or red) and RELIGION:
(agnostic) and AGE: (22 through 27 years) and EDUCATION: (col-
lege graduate) and HEIGHT: (63 through 68 inches) and WEIGHT:
(two pounds or less per inch of height).
The mode logic can be used to evaluate 256 records of the file at a time. One
status register bit can accumulate the Boolean result while another is used to
compute each parenthesized term. After all the tests have been made for each
set of records, the 256 MODEOUT values can be ORed together and sampled by the
control unit as shown in Figure h. U.l-l(b) . If the sixteen bit result is zero,
no match was found. A one bit in any position indicates that one or more of
the processors in a sixteen processor group contain matches. With proper bit
handling instructions and MODEIN transmissions like those of Figure k. U.l-2(b),
2U5
the control unit can process a sequence of MODEOUT signals of the type shown
in Figure U.U.l-l(c) and locate each match in the array. A control unit
specified route can shift the identifying number for the match to the control
unit's routing unit port.
6. 5 Matrix Inversion by Gaussian Elimination
In this section, we will discuss using the machine to solve systems
of equations or invert matrices using the familiar Gaussian elimination tech-
nique. The process can be used to solve several systems or invert several
matrices simultaneously. Two different situations are described: in the
first, a collection of inhomogeneous linear systems are to be solved in the
second, the inverses of the given set of matrices are to be found. The algor-
ithms are similar and store the original matrix in skewed form as suggested by
Kuck (1968) as illustrated in Figure 6.5-1. In the figure, the matrix and right
hand vector of the linear system Ax = b are shown. The A matrix is stored
skewed, but the "b" vector is stored all within the memory of one processor.
When skewed storage is used, parallel access to all of the elements of any
row or any column of the matrix can be achieved. In the figure, the rows are
stored across the processors with all elements of a given row having the same
word address in the various processor memories. The elements of a column, on
the other hand, all occupy different word addresses, so that processor index-
ing is required to fetch a column.
6.5.1 Solution of Inhomogeneous Systems
Up to thirty-two seven-by-seven inhomogeneous systems can be solved
simultaneously if their coefficients are stored as shown in Figure 6.5-1. The
Gaussian elimination procedure has two phases. In the first phase, the matrix
2U6
°1.1 °1,2 Ql,3 Q l,4 Q l, 5 Q l,6 Q l,7 b,
°2,7 °2,1 Q 2,2 Q2,3 Q 2,1 Q 2,5 Q 2,6 b 2
Q 3,6 Q 3,7 °3.l Q 3,2 °3,3 Q 3,4 Q 3,3 b,
°4,5 °4,6 Q4,7 °4,1 Q 4,2 Q4,3 a4,4 b4
Q5,4 a 5,5 Q5,6 Q5,7 °»A a 5,2 Q5,3 b 5
Q
6,3
Q6,4 Q6,5 Q 6,6 Q 6,7 a s,i Q 6,2 b6
^7,2 a 7,3 Q 7,1 Q 7,5 Q 7,6 a 7.7 07,! br
a:
o
o
a:
o
PROCESSORS
Figure 6.5-1 Storage Map for a Seven by Seven Inhomogeneous
System
2U?
of the original system is reduced to upper triangular form with ones on the
main diagonal. In the second phase, the solution is found by back-substitu-
tion, reducing the matrix to the identity matrix and the right hand side to
the solution. The technique processes the columns one by one, beginning with
column one and proceeding through the columns in turn to the rightmost (or
highest numbered) column. The matrix under consideration is gradually reduced
one column (and one row) at a time until an upper triangular system remains.
The steps in the algorithm, described in detail in the following
sections, are:
la) Find the element with the largest absolute value in the
lowest numbered column which remains under consideration,
and call it column i.
lb) Find the smallest row number of the several rows which
may contain elements with the value identified in step (la).
lc) Exchange the row identified in step (lb) with row i.
Both rows must be shifted so that they are properly skewed
in their new positions.
Id) Divide all the elements of the new row i by element A. ..
Divide the new b. by A.
. also. lsl
i J 1,1
le) For each of the rows i+1 through seven, multiply row i
by element A.
.
and subtract from row 1.
At the completion of steps (la) through (le), the matrix will be in the upper
triangular form. The back substitution steps proceed from the last row's right
hand side element, b, back through that of the first row. They operate on the
columns of the upper triangular matrix from the highest numbered back through
to the first. The steps are:
2a) Distribute b. for use with all rows from 1 to 1-1.
J
d
2b) Multiply row j by element A±
t j for each row i from 1 to
j-1, and subtract the resulting multiple of row j from row i.
2^8
the result of the back substitution steps is to reduce A to the identity matrix
and the column of Vs to the sought solution vector.
The seven steps outlined above are described in detail in the fol-
lowing seven sections.
6.5.1.1 Find the Pivot Element in the Leftmost Remaining Column
The matrix was stored in skewed form as shown in Figure 6.5-1 so
that all elements of any desired column would be available in parallel. The
element in the leftmost remaining column with the largest absolute value is
found by a process which resembles the log sum process described in section
6.2.3. In that section, however, the number of cooperating processors was al-
ways a power of two in number, while here, the number of processors varies
from step to step all the way from two up to the size of the system being
solved. In section 6.2.3, the processors which were cooperating were con-
tiguous; here, because the matrix is stored in skewed form, the elements which
must be considered together may not be stored in contiguous processors. We
will ignore the noncontiguity and describe the algorithm as though the pro-
cessors were contiguous. The Clos routing network, which can perform every
permutation, can be used to facilitate the desired connections.
For a collection of processors which are a power of two in number,
the steps are the same as in a log sum, except that each processor selects
the
larger of the two elements it considers at each step rather than producing
their sum. The number of comparison steps is the logarithm of the number of
processors to the base two. When the total number of processors is not a
power of two, subsets of the total number which each contain a power of two
processors form partial results which are then combined pairwise until the
'
2i+9
a t>! b2 Ci c 2 c 3 c4
Figure 6.5.1.1-1 The Log Combination Process for a Collection of
Processors not a Power of Two in Number
250
;e:
final result is produced. There is one such subset for each one bit in the
binary representation of the number of processors. Figure 6.5.1-1-1 illustrate:
the process for seven processors. Three comparison steps are required; in
general, the number of comparison steps is the logarithm to the base two of
the smallest power of two which is greater than or equal to the number of pro-
cessors.
6.5.1.2 Find the Smallest Numbered Row which Contains the Pivot Element
Once the pivot element value is identified, each processor which
stores that element submits its row number for a minimum seeking comparison
process. Processors which do not store the pivot value - by far the majority -
submit a value which exceeds the number of rows in the matrix. A log minimum
<
process determines the row number of the row to be exchanged with the lowest
numbered currently considered row. At the completion of this step, every
active processor contains the number of the row which contains the pivot ele-
ment .
6.5.1.3 Exchange of the Pivot Row with the First Active Row
The number of the pivot row is available to all active processors
as the result of the previous step. The first active row number is available
by broadcast from the control unit. The difference of the two values is the
amount that the pivot row must be shifted left and the first row shifted left
to retain the correct skewed storage relationships. This shifting process
goes on in parallel for each of the systems being solved by the 256 processor
array. The shifting algorithm proceeds as follows:
251
1. Each processor puts the shift distance - a binary integer of eight or less
bits - in its eight bit status register within the mode logic. The number of
bits to be considered is the same as the number of steps in the log comparison
process which identified the pivot element.
2. For each bit to be considered, the mode of the processor is set from the
proper status register bit. The pivot row elements are shifted left by the
amount specified by the selected bit; the shifted values are stored under
mode control so that the shift takes place only in those processors - that is
only in those equation systems - for which a shift by that distance is re-
quired.
3. The first row still under consideration is shifted right by a process simi-
lar to that described in step two above. The only difference is that right
shifts are used instead of left shifts.
6.5.1.H Divide the Pivot Row by the Pivot Element
The pivot element was distributed among all active processors by
the steps described in section 6.5.1.1. This value is divided into each
element of the pivot row. This step leaves the pivot element exactly one in
value
.
6.5.1-5 Reduce the Leftmost Column to Lower Triangular Form
The pivot row is the lowest numbered remaining row, and it has been
normalized by the previous step so that the pivot element is one. For all
rows below the pivot row, we
1. distribute the element in the pivot column to all active
processors by a log distribution process, and
2. multiply a temporary copy of the pivot row by the distributed
element and subtract from the subject row.
252
The completion of the above two steps for all rows beyond the pivot row
reduces the lowest numbered remaining column to lower triangular form.
6.5.1.6 Back Substitution
At the completion of the previous steps, the matrix is in upper
triangular form with ones on the main diagonal. Back substitution reduces
this upper triangular form to the diagonal identity matrix. The last row of
the upper triangular form contains only a one in the last column and all the
rest zero elements. The back substitution process uses successive main diag-
onal ones from right to left as follows.
1. For each row above the row which contains the current main diagonal one,
distribute the element in the column which contains that main diagonal one by
a log distribution process.
2. Multiply a temporary copy of the row with the main diagonal one by the
distributed element and subtract from the row from which the distributed ele-
ment was taken. Include the right hand side vector in the multiplication and
subtraction process.
At the completion of the above two steps for all main diagonal elements from
right to left, the original matrix is reduced to the identity matrix and the
right hand side vector becomes the solution to the given set of equations.
6.5.1.6 Efficiency and Routing Requirements of the Gaussian Elimination Proc es
The Gaussian elimination process described in the preceeding sec-
tions clearly requires routing operations beyond the capabilities of the omega
network. The Clos network is necessary to support this algorithm, but we do
not currently have algorithms to compute the necessary control patterns.
As we have seen, the technique described in this section begins with
253
all processors in productive use, proceeds until only a fraction of the pro-
cessors are contributing, and returns to the condition where all processors
are in productive use. On the average, approximately half of the processors
are productive. When a great many matrices are to be processed, they should
be handled 256 at a time by a conventional program with one matrix (or sys-
tem) stored in each of the 256 processors. No inter-processor communication is
required. A collection of 128 or more matrices (or systems) can be processed
in this way with a processor efficiency at least as good as for the parallel
technique described above.
6.5.2 Inversion of a Matrix
To invert an N by N matrix with the Gaussian elimination technique,
one begins with an N by 2N matrix which includes an identity matrix appended to
the right of the given matrix, extending each row to twice its original size.
In a parallel processor, the best approach is to store the given matrix in
skewed form and the appended identity in non-skewed form in the same set of
processors with the given matrix. The operations performed on the given matrix
under the Gaussian technique are also performed on the appended identity matrix
(except for the shifts to reskew the identity). At the completion of the pro-
cess, the given matrix has been transformed to an identity matrix and the ap-
pended identity matrix is transformed to the inverse of the given matrix.
2$k
7. Operating Parameters of the System
This section summarizes the cost, reliability, and power consump-
tion of the system. The calculations are based on the component counts shown
in Figures 7-1 through 7-U which give detailed component counts, prices and
power requirements for the processor, memory module, sixteen by sixteen cross-
bar and table look up hardware. Table 7-1 summarizes these figures and gives
total parts counts and costs for these units; total costs are calculated in-
cluding the spares indicated, and power and parts counts include only the
units needed to form a complete operating system. These costs were derived
from data taken from competitive bids, parts orders for parts for the multi-
plier prototype which was built and telephone calls to suppliers. Assuming
that assembly costs will be approximately equal to integrated circuit costs,
the total cost for a 256 processor system with eight million words of data
memory and 128,000 words of program memory is approximately $3,000,000 if a
Clos three stage routing network is built.
A system with an omega routing network would be approximatley
$100,000 less expensive. The cost figures do not include the costs of air
conditioning equipment.
The operating life of an integrated circuit component depends on
the operating temperature. The prices quoted for parts in Figures 7-1 through
7-U assume that the lower cost SN7U00 series parts, whose operating tempera-
tures must lie between zero and seventy degrees Celcius, are used. Figure 7-
1
is a graph of the expected component failure rates versus temperature. The
failure rate data were taken from a Signetics Corporation report supplied to
the author by a supplier (Signetics Corporation, 197^b), and refer to that
255
COMPCNENT NUMBER WATTS PER COST PER
OF UNITS UNIT UNIT
10124 2 0.468 1i 4.50
10125 2 0.540 1I 4.50
AM25S10 16 0.467 1 2.60
AM9309 10 0.240 1 6.00
AM9334 1 0.240 Ji 5.20
NAT8551 1 0.360 1> 1.00
74S02 2 0.050 ii 0.54
74S04 3 0.050 it 0.47
74S11 3 0.050 ii 0.52
74S20 2 0.050 1» 0.50
74LS32 1 0.049 i 0.34
74S51 4 o.no i. 0.23
74H52 10 0.275 JI 0.23
74H61 1 0.080 1i 0.22
74S64 2 0.250 i> 0.38
74S74 4 0.250 4i 0.75
74S85 8 0.250 i 3.93
74S86 1 0.250 1S 0.71
74S133 3 0.300 4i 0.42
74148 2 0.190 1 1.50
74150 2 0.340 Ji 1.41
75S151 4 0.225 i 2.25
74S153 14 0.225 5I 4.50
74S157 22 0.390 4i 3.76
74S158 1 0.305 i( 3.76
74S172 40 0.500 1i 5.99
74S175 1 0.480 it 1.68
74S181 7 1.100 3 3.15
74S182 6 0.260 1i 4.86
74S195 16 0.545 1I 1.68
74S257 12 0.495 4 3.76
74S260 13 0.300 iI 0.42
74S274 36 0.500 1t 12.50
74S283 12 0.500 3i 2.76
74S299 1 0.500 iI 1.50
74S381 21 0.800 3i 3.15
SIG8204 4 0.850 i* 27.20
SIG8205 2 0.850 3 fc 33.40
SIG8228 33 0.512 3i 21.87
SIG8243 12 0.500 1> 4.95
SIG8263 5 0.475 Ji 4.50
TOTAL NUMBER OF COMPONENTS: 342
TOTAL POWER DISSIPATION: 154.923 WATTS.
TOTAL COST: $ 2236.04
Figure 7-1 Component Statistics for the Processor
256
COMPONENT NUMBER WATTS PER COST PER
OF UNITS UNIT UNIT
AMS 304 0.400 $ 6.12
74S04 1 0.270 $ 0.47
74LS138 1 0.055 $ 1.43
74154 2 0.280 S 1.35
74S157 2 0.390 $ 1.43
74S280 12 0.525 $ 0.40
SIG82S42 10 0.290 $ 0.71
TOTAL NUMBER OF COMPONENTS: 332
TOTAL POWER DISSIPATION: 132.465 WATTS.
TOTAL COST: $ 1879.84
Figure 7-2 Component Statistics for One Processor Memory of
32,768 Words with Thirty-eight Bits Each
257
COMPONENT NOMBER WATTS PER COST PER
OF UNITS UNIT UNIT
10101 36 0.135 $ 0.47
L0115 12 0.135 $ 0.47
10133 32 0.390 $ 2.95
10145 16 0.754 $ 13.00
10158 16 0.200 $ 1.55
10164 256 0.390 $ 1.65
TOTAL NUMBER OF COMPONENTS: 388
TOTAL POWER DISSIPATION: 136.764 WATTS.
TOTAL COST: $ 781.56
Figure 7-3 Component Statistics for One Sixteen by Sixteen Crossbar
258
COMPONENT NUMBER WATTS PER COST PER
OF UNITS UNIT UNIT
10124 2 0.468 $ 4.50
10125 2 0.540 $ 4.50
74S157 4 0.390 $ 3.76
74LS193 4 0.155 S 2.12
74S195 16 0.545 $ 1.68
TOTAL NUMBER OF COMPONENTS: 28
TOTAL POWER DISSIPATION: 12.916 WATTS.
TOTAL COST: $ 68.40
Figure T-U Component Statistics for One Table Look Up Unit
Exclusive of the Memory
2 59
a
H
Eh
w
-p
>
Pi
>
O
w
-P
Ph
5
-p
o
o
Pi
w
-p
Ih
PL,
s
o
o
VO
oo
o
t-
VO
oo
LfN
vo -=r
co
on
oo
o
OJ
CM
LTN
t—
OD
CM
00
VO
H
ON
OJ
vo
oo
VO
H
CM
H
CO
vo
LTN
CM
VO
t-
CM
CO CM VO
H
OO
CO
o
o
VO
o
CO
-3-
o
CM
00
o
vo
00
VO
o
VO
VO
CM OO
o
o
oo
o
CM
00
o
VO
o
-3-
o
OJ
CM CM CO
OO CO
OO oo
CO
CM
^
u o
o Pi O
w oj ti vA td
CD >* r^ w W CO en W)
OJ P C/3 O 0) <D J O <u
o o CO H g H <: H So 3 o O O P Ph EH CJI O
p <u p cd D o
PM S o EH B-f
CM
0\
ON
ON
CO
VO
LTN
CM
CO
ON
H
O
CM
OQ
J-
H
-te-
o
o
oo
c—
CO
CO
-3-
o
CM
On
H
O
ooO
CO
LTN
OO
rH
-09-
P
Ph
W
a
o
o
P
CD
!*
O
Ph
a
-P
o
o
-p
Pi
0)
o
Ph
6
o
o
-p
Pi
CD
PI
o
!•
o
o
e
0)
-p
CO
co
i
Eh
260
o
b
o
ro
ro
CD
ro
(X)
m
b
o
o
o
oj
OJ
FAILURE RATE (%/1000 HOURS)
ro oj ^ o! 01 s cd to
O
O
ro oj -»
85°C
"7/"\0 /»(I) I
50° C
/25°C
Figure 7-5 Graph of Component Failure Rate Versus Temperature
261
company's SN7^00 line. This report presented the most comprehensive review
of failure rate data which the author was able to obtain. The data in the
report pertain to the low power Shotty devices in the Signetics 7^+00 lines,
not to the regular (non-low power) devices used in this design. Table 7-2
gives the failure rate data for a 200,000 integrated circuit component system
using values taken from the graph in Figure 7-5- As the table indicates, we
should expect the system to operate for twenty-six to forty-five hours between
failures. Several spare processors, crossbars and memory modules will be
available to replace a unit which fails. No design for the control unit was
included since work came to end before that was possible. However, because
of its critical role in the system, it could well be the best policy to build
two complete control units so that a spare one would be available in the event
of a control unit failure.
262
Temperature
°C
Number of
Failures
per 1000 hours
Failures per
1000 hours
for a 200,000
component system
Mean Time
Between
System Failures
(hours)
8 5°C
TO°C
50°C
0.00019
0.00011
O.OOOOHl*
38
22
9
26
111
Table 7-2 System Reliability
263
8. Conclusion
The author believes that the forgoing sections - mainly section
U, section 6.2, and section 7 - show that a computer with roughly 100 times
the computing capacity of the IBM 360/95 can be built for significantly less
than other computers with similar capability.
Another result of the work described here is the simulation meth-
odology described in section 5 and illustrated in the appendix.
Considerable work remains to be done on the routing system. Although
we believe that an omega network is sufficient to support the intercommuni-
cation needs of the general circulation model, the matrix manipulation
example of section 6.5 shows that the three stage Clos network would provide
support for a wider class of problems at a modest increase in cost. However,
we have no algorithm to produce control patterns for the Clos network.
26k
References
Advanced Micro Devices Incorporated, 197^
Advanced Micro Devices Data Book , Advanced Micro Devices Incorporated,
197^.
Arakawa, 1972 . _
Arakawa, A., Design of the UCLA General Circulation Model , Technical
Report Number 7 of the Department of Meteorology, University of California
at Los Angeles, July 1972.
Benes 1965
Benes, V. E. . Mathematical Theory of Connecting Networks and Telephone
Traffic, Academic Press, 1965-
Breuer, Melvin A. (Editor), Design Automation of Digital Systems, Volume
One, Theory and Techniques , Prentice Hall, 1972, pp. 101-172.
Carroll 1967
Carroll, Arthur B. and Wetherald, Richard T. , "Application of Parallel
Processing to Numerical Weather Prediction," Journal of the Associat ion
for Computing Machinery , Volume lU, number 3, July 1967, pp. 591-614.
'cios, Charles, "A Study of Non-blocking Switching Networks
" Bell System
Technical Journal , Volume 32, number 2, March 1953, pp. 406-424.
Dietmeyer, 1975
, ,
Dietmeyer Donald L., Chairman of the 1975 workshop on computer hardware
description languages and their applications, verbal communication, 1975-
Downing, 1974 ,.„.,. . TT ,
Downing, Robert, Physics Department, University of Illinois at Urbana-
Champaign, verbal communication, 1974.
Fox 196l
'
Fox, L., Numerical Solution of Ordinary and Parital Differential
Equations,,
Pergamon Press, 1962, p. 348.
Fu 1968
Fu , K. S., Sequential Methods in Pattern Recognition and Mac
hine Learning,,
Academic Press, 1!
Garcia 197^-
Garcia, G. , Department of Computer Science, University of Illinois
at
Urbana-Champaign , unpublished communication, 1974.
265
Gates, 1975
Gates, W. L. , RAND Corporation, oral communication, 1975-
George, 1971
George, Theodore A., "ERTS A and B - The Engineering Systems," Astronautics
and Aeronautics , Volume 9, number U, April 1971, pp. Ul-51.
Halem, 197^
Halem, M. , Goddard Institute for Space Studies, oral communication, 197^.
Hamming, 1950
Hamming, Richard W. , "Error Detecting and Correcting Codes," Bell System
Technical Journal , Volume 29, April 1950, pp. 1U7-160.
Hnatek, 1973
Hnatek, Eugene R., A User's Handbook of Integrated Circuits , John Wiley
and Sons, 19lh.
IBM, 1970
IBM System/ 36O Principles of Operation , file number S36O-OI, order number
GA22-6821, version 8, 1970, pp. Ul-U2
.
Karn, 19lh
Karn, Ronald, Goddard Institute for Space Studies, oral communication,
197U.
Karn, 1975
Karn, Ronald, GISS Model ILLIAC Implementation , Computer Science Corpora-
tion, 1975.
Kasahara, 1967
Kasahara, A. and Washington, W. M. , "NCAR Global General Circulation
Model of the Atmosphere", Monthly Weather Review , Volume 95, number 7,
July 1967, pp. 389-^02.
Knuth, 1968
Knuth, Donald E. , The Art of Computer Programming, Volume 1: Fundamental
Algorithms , Addison-Wesley , 1968.
Kuck, 1968
Kuck, David J., "ILLIAC IV Software and Application Programming," IEEE
Transation on Computers , Volume 17, August 1968, pp. 758-770.
Lawrie, 1973
Lawrie, D. H. , "Memory-Processor Connection Network," PhD Thesis, Depart-
ment of Computer Science, University of Illinois at Urbana-Champaign,
report number UIUCDCS-R-73-557, February 1973.
266
Ledley, i960
Ledley, Robert S. , Digital Computer and Control Engineering , McGraw-Hill,
I960, pp. 519-525.
Lorentz, 1963
Lorentz, E. N. , "The Predictability of Hydrodynamic Flow," Transactions
of the New York Academy of Science , 1963, serial 2, pp. 1409-^32.
Manabe, 1969
Manabe, S. and Bryan, K. , "Climate Calculations with a Combined Ocean-
Atmospheric Model," Journal of the Atmospheric Sciences , Volume 26,
number U, July 1969, pp. 786-789-
Mintz, 197^
Mintz, Y. and Arakawa, A., Notes distributed at the second workshop on
the UCLS general circulation model, March 25-April k, 197*+, Department
of Meteorology, Univeristy of California at Los Angeles.
National Semiconductor Corporation, 197^
Digital Integrated Circuits, National Semiconductor Corporation, 197^-
Ray, 197^
Ray, Robert M. , Thomas, John and Donovan, Walter E. , Implementation of
ILLIAC IV Algorithms for Multispectral Image Interpretation , Center for
Advanced Computation document number 112, Center for Advanced Computation,
University of Illinois at Urbana-Champaign , June 197^-
Semptner, 197 1*
Semptner, A. J., Department of Meteorology, University of California at
Los Angeles, oral communication, 197*+.
Signet ics Corporation, 197*+A
Signet ics Digital, Linear, and MPS Data Book , Signetics Corporation, 197*+.
Signetics Corporation, 197*+B
Signetics Bipolar Junction Isolated TTL Low Power Shottky Integrated
Circuit Failure Rates , Signetics Corporation, November 197^-
Slotnick, 1962
Slotnick, Daniel L. , "The SOLOMON Computer," Proceedings of the 1962
Fall Joint Computer Conference , Spartan Books, 1962, pp. 97-107-
Slotnick, 1968
Slotnick, Daniel L, et al. , "The ILLIAC IV Computer," IEEE Transactions .
on Computers , Volume 17, August 1968, pp. 7^6-757-
267
Smagorinsky, 1963
Smagorinsky, J. , "General Circulation Experiments with the Primitive
Equations: I. The Basic Experiment," Monthly Weather Review , Volume 91
»
number 3, March 1963, pp. 99-161*
.
Somerville, 197^
Somerville, R. J. C, et al., "The GISS Model of the Global Atmosphere,"
Journal of the Atmospheric Sciences , Volume 31, number 1, January 197^,
pp. 8U-117.
Stenzel, 1975
Stenzel, William, "A Class of Compact High Speed Parallel Multiplication
Schemes," Masters Thesis, Department of Computer Science, University of
Illinois at Urbana-Champaign , 1975-
Tessler, 1968
Tessler, Larry G. and Enea, H. J., "A Language Design for Concurrent Pro-
cesses," Proceedings of the 1968 Spring Joint Computer Conference
,
Thompson Book Company, 1968, pp. UOS-^OS.
Thomas, 197^A
Thomas, John, An ILLIAC IV Algorithm for Cluster Analysis of ERTS-1 Data
,
Center for Advanced Computation Technical Memorandum Number 17, Center
for Advanced Computation, University of Illinois at Urbana-Champaign,
May 197^.
Thomas, 197^B
Thomas , John , An ILLIAC IV Algorithm for Statistical Classification of
ERTS-1 Data , Center for Advanced Computation Technical Memorandum Number
18, Center for Advanced Computation, University of Illinois at Urbana-
Champaign, May 197^-
Texas Instrument Corporation, 1973
The TTL Data Book for Design Engineers , first edition, document number
CC-Ull, Texas Instruments Incorporated, 1973.
Texas Instrument Corporation, 197^
Supplement to the TTL Data Book for Design Engineers , first edition,
document number CC-U16, Texas Instruments Incorporated, 197^-
•
Tsang, 1973
Tsang, L. C. and Karn, R. , A Documentation of the GISS Nine-Level Atmos-
pheric General Circulation Model , Computer Sciences Corporation, October
1973.
Wacker, 1970
Wacker, Arthur G. and Landgrebe, David A., "Boundaries in Multispectral
Imagery by Clustering," presented at the 1970 IEEE Symposium on Adaptive
Processes, December 1970.
268
Williamson, 1973
. .
Williamson, D. L. and Washington W. M. , "On the Importance of Precision
for Short Range Forecasting and Climate Simulation," Journal of Applied
Meteorology , Volume 12, 1973, pp. 125^-1258.
269
Appendix
The material in this appendix is a sequence of computer printout
which gives the complete set of control cards, logic description and control
data (STEPs) which were used to test the floating point addition subset of the
array processor.
270
//COMPEL EXEC PGM=COMPEL t REGION= 154K, PARM
//SYSPRINT OD SYSOUT=A
//DECK OD DSN=£DECKF0G,UNIT=DISK,DC8=<BLKSIZE
// SPACE = (TRK,(t) f 1 J ) ,DISP=( NEW, PASS)
//DECK DC SYSOUT=A,DCB=(BLKStZE=800,RECFM=FB)
//MICRO OD DSN =£.MICF0G,U.NIT^DISK,DCB=(8LKSIZE = 3120,RECFM
// SPACE ={TRK, (S, I) ) rDISP=( NEW, PASS)
//MICRO DO SYS0UT=A,DCB=(BLKSIZE=800 f RECFM=FB1
//PLIOUMP DD SYSOUT=A
•R,ISA(74K)'
3120,RECFM=FB)
,
F8),
01/00100
01/00200
01/00300
01/00400
01/00500
01/00600
01/00700
01/00000
01/00900
271
$ THIS LOGIC TESTS THE "A" FRACTION FOR ZERO
ATESTl I) : S260 A( 1 ,4) ;
ATESTI2) : S260 A(5,4) ;
ATESTl 3) : S260 A(9,4) ;
ATESTI4) : S260 AI13,4)
ATEST15) : S260 A( 17,4)
ATESTI6) : S260 AI21.4)
ATESTl 7) : S260 A(25,4)
ATESH8) : S260 A(29,4)
AZERO : S133 ATESTl 1,8)
10 : OUTPUT AZERQ BZERO
20 : OUTPUT ATESTl 1,8) i
02/00100
02/00200
02/00300
02/00400
02/00500
02/00600
02/00700
02/00800
02/00930
02/01000
02/01100
02/01200
272
$ THIS LOGIC TESTS THE MB" FRACTION FOR ZERO
BTESTl 1) : S260 B{ 1,4) ;
BTEST12) : S260 8(5,4) ;
8TfST(3) : S260 B(9,4) ;
BTEST(4) : S260 8(13,4)
BTEST(5) : S260 B( 17,4)
BTESN6) : S260 B(2l,4)
BTfcST(7) : S260 B(25,4)
BTEST18) : S260 B(29,4)
BZERO : S133 BTEST(1,8)
20 : OUTPUT BTESTl i,8)
03/00100
O3/O0200
03/00300
03/00400
03/00500
03/00600
03/00700
03/00800
03/00900
03/01000
O3/O1100
273
S THIS LOGIC CONTROLS THE ALIGNMENT SHIFTING
ASHSEL : S20 EXC2 AZERO BZERO SHZERO ;
BSHSEL : S20 EXC2BAR AZERO BZERO SHZERO ;
: OUTPUT SHZERO ;
, .....«..-,
ASHlFT(l,3) : SL57 A8S15,3> ZEROSI1.3) ASHSEL
liSHlFT(l,3) : S157 ABS(5,3> ZEROSd.3) BSHSEL
GTR8 : S260 ABS( I t 4 > 5
ENASH : S51 GTR8 AINH ZERO ZERO ;
LN8SH : S51 GTR8 BINH ZERO ZERO ;
20 : OUTPUT GTR8 ENASH F.NBSH ;
ZERO
ZERO
04/00100
04/00200
04/00300
04/00400
04/00500
04/00600
04/00700
04/00800
04/00900
04/01000
2lh
$ THIS LOGIC COMPARES
ALESSK 1 ) UNUSED AGTR
ALESSH2) UNUSED AGTR
ALESS113) UNUSED AGTR
ALESSK4) UNUSED AGTR
ALESSK5) UNUSED AGTR
ALESSK6) ABEQ1 AGTR1
ALESS2 ABEQ2
AHIGH( 1,
BHIGH( 1,
ALESS AB
: UUTPUT
10 : UUT
20 : OUT
20 : OUT
20 : OUT
AGTR2 :
ALESS
4) : FORM All
4) : FORM BU
EQ AGTR : S85
A(l,32) (HI,
PUT ALESS ABE
PUT AHIGH(1,4
PUT ALESSlll,
PUT ALESS2 AB
A" AND "8" FRACTIONS
S85 A(4,4) B(4,4) 818) ZERO A(8) ;
A(9,4) 8(9,4) B<13) ZERO A(13) ;
A(14,4) B(14,4) 0118) ZERO A(18)
A(19,4) B(19,4) B(23) ZERO A(23)
A(24,4) 8(24,4) 8(28) ZERO A(28)
AI29,4) B(29,4) ZERO UNE ZERO ;
ABEQ2 AGTR2
S85
S35
S85
S85
S85
THE
H 1)
1(2)
1(3)
1(4)
1(5)
(6)
S85 AGTRK2.4) ALESSK2,4)
1(6) AttEQl AGTRK6) ;
,3) AGTRK1) ;
,3) ALESSl(l) ;
AHIGH(1,4) BHlGH(l,4) ALESS2
32) ;
Q AGTR ;
) BHIGHU.4) ;
6) AH I'm AGTR1(1,6) ;
EU2 AGTR2 ;
05/00100
05/00200
05/00300
05/00400
05/00500
05/00600
05/00700
05/00000
05/00900
05/01000
05/01100
O5/O1200
05/01300
05/01400
05/01500
05/01600
05/01700
275
$ THIS LOGIC PRODUCES THE ADDER FUNCTION FOR AOD AND SUBTRACT 06/00100
i THE PRIMARY MEANS FOR THIS THE THE SIG8205 ROM 06/00200
JUNK(1,5> : FORM AZERO B/.ERU EXC2BAR CUADO CU5UB ; 06/00300
JUNKIfa,^) : FORM EXPAUJ EXPB(l) AGTR ABEQ ; 06/00-+00
ADDA0DRI1,9) : FORM JUNK(1,5) JUNIU6.4) ; 06/00S00
XXII, 4) : S02 ABSIl,4) ABS(4,<f) ; 06/00600
ABEXEQ : S20 XX(l) XX(2) XXI3) XX14) ; 06/00700
ADDCNTUl,8) : SIGB205 ADDADDR(1,9) ; 06/00800
AFUNCK1.4) : S257 ADDCNTLI1.4) ADDCNTU5,*) ABEXEQ ENABADD ; 06/00900
AFUNC(l,3) : WOR AFUNC1I1,3) CUAFUNC11.3) ; 06/01000
10 : OUTPUT AUDADDRI1,9> ; 06/01100
5 : OUTPUT AD0CNTLIl,8) ; 06/01200
20 : OUTPUT ABEXEQ ; 06/01300
10 : OUTPUT AFUNCl(lf3) ; 06/01400
: OUTPUT CUAFUNC11.3) CUADD CUSUB ; 06/01500
SIGN : S157 EXPO(l) AFUNCK4) NINH ZERO ; 06/01600
: OUTPUT SIGN ; 06/01700
276
$ THIS IS THE "A" ALIGNMENT SHIFTING LOGIC O7/00100* inio 10 inu » 07/0070(1
: OUTPUT AINH ; V^ nn i n n
LEFTU.8,4) : SIG8243 A(l,8,^> ASHIFTIW3) S^S
ENASH ONE ONE ; 07/00<t00
LEFT(2,8,4) : SIG8243 A(2,8,4> ASHIFTtl.3) 07/00500
ENASH ONE ONE J 07/00600
LEFT13,8,4> : SIG82<*3 A(3,8,*> ASHIFTU.3I V^nVZ
ENASH ONE ONE ; „ , ,?°?°°
LEFT!*, 8,*) : SIG0243 AU,8,4> ASHIFTC1.3J 0//00900
fcNASH ONE ONE ; 07/ °"°°
5 : OUTPUT LEFTU.32) ASHIFT(1,3) ; O7/OUD0
277
$ THIS IS THE "8" ALIGNMENT SHIFTING LOGIC oa/2o?00
ARlGHT(l,8,4) : SIG8243 811,8,4. BSHIFTll.ll 03/0^300
EN8SH ONE ONE ; nJ/iinfoo
ARIGHT(2,8,4) : SIG8243 B<2.8,4> BSHIFTI1.3) S2/So500
ENBSH ONE ONE ; n«/oo&00
ARIGHT(3,8,4> : SIG8243 8(3,8,0) BSHIFTU.3I SS/So?00
ENOSH ONE ONE ;
ofl/00800
ARIGHTU.8.*) : SIG8243 8(4,8,4) BSHIFTU.3I 08/009^0
ENQSH ONE ONE i 08/01000
5 : OUTPUT ARIGHTU.32) BSHIFT(1,3) ;
278
V THIS IS THE LEFT SHIFT LOGIC USED IN
NORMALIZATION
S3/SS5SS
: OUTPUT NSHIFT(1,3) NINH ; no/OOiOO
NSHIFTU.3J : S157 NSHU.31 ZEROSI1.3I ZFF ZERO ; SI/SSJSS
NSH(lt3» : TI1*8 BTEST(ltS) ; nq/nnsnn
: SIG6243L BCl.8,4) NSHIFTI1.31 S2/Q0600
NINH ONE ONE ; 09/00700
: SIG8243
f
LJ , ^'' ) NSHIFT,U3 » 09 8NINH ONE ONE ; OQ/OOQOO
: SIG8243L BI3.B.«I NSHIFTI1.3I ^/OOJOO
NINH ONE ONE ; 09/0 100
: SIG3243L B(4,8,*> NSHIFTU.3) 09/0 200
NINH ONE ONE ; 09/0 300
N0RMU.32) NSHIFT(1,3) ; 09/01*00
NSH(l,3) ;
NORHlli,B,4)
NORMI2,,8,4)
N0RM13,»8t*>
NORM!*.,8,4)
10 : OUTPUT
10 : OUTPUT
279
RIGHT(1*32I : WAND ARIGHr(l,32) NORM{1,32) ; 10/00100
5 '.OUTPUT RIGHTU.32) ; 10/00200
280
* THIS IS THE PRIMARY EXPONENT AOOER 1/00200
: OUTPUT EXPA(1,8) EXPB(l.O) ; 11/00100
AEXP0U.8) : FORM ZERO EXPA12.7) ; ti/oO'.OO
AEXSTR : Sll ZFF AEXSTRC ONE ; ,L^
AEXP(1.5) S157 AEXP011.5I ZER0SU.5) EX157 ZERO ; Wioltla
IexpUIsI I S157 A0XPO16.3I NSHIFTU.3) EX157 AEXSTR i Jj/OOOOO
10 : OUTPUT AEXP01U8) ; 1 1/00300
: OUTPUT EX157 ; 11/00900
BEXPll.8) : FORM ZERO EXPB<2,7> ; 11/01000
XORS1GN : S86 EXPA(l) EXPBIll ; 11/01100
: OUTPUT EXCARRY ; 11/01200
BAFUNCU.3I : FORM ZEROS(l,2) ONE ; "
ABG(2) ABP 2) : S381GP AEXP15.4) 8EXP15.0 ABFUNCtl.3 ; }J/0
1300
ABG 1 ABP(l) : S381GP AEXPU.4) BEXPU.4) ABFUNCI1.3 !
aAP(2) : S381GP AEXP(5,4> BEXPI5.4I BAFUNC 1.3 ! } /J
i
BAG I 8AP(l) : S381GP AEXPU.4I BEXP<1,41 BAFUNCU.3) i 1 /J 700
EXl 1 4) : S381 AEXPU.4I BEXPtl,*) ABFUNC(l,3) FABC4 ; \\i°n \l aQ
EX 54 : S38i AEXP<5,4) BEXP(5,4) ABFUNC 11.31 EXCARRY ;
EXBAU.4) : S381 AEXPU.4) BEXPC1.4I BAFUNCU.3) F8ACO ; 1/^2000
iwitS 4! 1 S381 ACXPC5.4) BEXPI5.4I BAFUNCtl.31 ONE ; \uonlo
^Uo 7 UNUSE D5^f^CrEXcI^^E5
,
I^!B
C
2
2
ONE F AG(U FBAPU.4, , U/'SS
UNU UNUHI FA8C4 EXC2BAR UNUSED : S182 EXCARRY FABGU.4, FABPIl.tl I Il/0»;0
FABGI1.4) : FORM ONESU.2) ABGtl.2)
FA8PU.4) : FORM ONESU.2) AflP(l.2)
FBAG<1,4) : FORM ONESI1.2) BAG11.21
FBAIM1.4) '• FORM ONES(L,2) BAP(l.2J
11/02500
U/02600
Ll/02700
11/02800
5 : OUTPUT ABS11.7) EXlil.8) EXBA(l,8J ; U/029G0
: OUTPUT ABFUNCd.3) ; 1 1/03000
5 : OUTPUT AEXPU.8) BEXPil.8) ; Ll/03100
10 : OUTPUT FABC4 EXC2BAR EXC2 ; 1 1/03200
10 : OUTPUT FBAC4 ; 11/03300
15 : OUTPUT ABGU.2) ABPtl.21 ; 11/03^00
15 : OUTPUT BAG(l,2) BAP(1,2> ; 1 1/03500
: OUTPUT EXCARRY ; 11/03600
: OUTPUT XORSIGN ;
281
» THIS IS TH
ENA80AR : SO
AC : H52 ENA
: OUTPUT CUA
AGH< 1) APIH I
E 32-
4 ENA
flAOO
c ;
AGH(2)
AGH13)
AGHI4)
AGL I L )
AGL(2)
AGL (3)
AGLI4)
AG2 AP2
APHI2
APHI3
APH( 4
APLU
APL(2
APL(3
APL14
AC4H
AG1 API AC4L
AC16 : S182X
ACGUT i S182
SUM! 1,4)
SUM( 5i4)
SUM(9,4)
SUMl 13,4)
SUM{ 17,4)
SUM(21,4)
SUM{25,4)
SUM(29,4)
: OUTPUT
S
S
S
S
S
S
S
S
AC8H
AC8L
AC A
X AC I
381 L
381 L
381 L
S381
S381
S381
S381
S381
BIT FRACTION ADDER
ttADD ;
CUAC ENAB8AR AFUNCU2) AFUNC1C3) ;
381GP LEFTU.4) RIGHTU,4) AFUNC(l,3) ;
381GP LEFT(5,4> RIGHT15.4) AFUNC(l,3) ;
381GP LEFT<9,4> RIGHT<9,4) AFUNC<1,3) ;
381GP LEFT(13,4) RIGHTU3,41 AFUNC(l,3)
301GP LEFT<17,4) RIGHT{17,4) AFUNC(l,3)
381GP LCFT(21,4) RIGHT(2l,4) AFUNC(1,3)
381GP LEFTI25.4) RIGHT<25,4) AFUNCI1,3)
381GP LEFT(29,4) RIGHT(29,4) AFUNC(1,3)
AC12H : S182 AC16 AGH(1,4) APH(l,4) ;
AC12L : S182 AC AGL(1,4) APLU, 4) ;
Gl API ;
6 AG2 AP2 ;
EFT(i,4) RIGHT(1,4> AFUNC(l,3) AC 12H J
EFT(5,4) RIGHT(5,4) AFUNCU,3) AC8H ;
EFT<9,4) RIGHT<9,4) AFUNC(l,3) AC4H ;
LEFT(13,4) RIGHTU3.4) AFUNCll.3) AC16 j
LEFTI17.4) RIGHT! 17,4) AFUNCd.3) AC12L
LEFT(2l,4) RIGHT(21,4) AFUNC(1,3) AC8L
LEFT125.4) RIGHT(25,4> AFUNC(l,3) AC4L !
LEFT(29,4) RIGHT(29,4) AFUNCIl,3) AC 5
AC
5 :
15
15
5 :
15
15
5 :
15
OUTPUT L
OUTPUT
OUTPUT
OUTPUT S
: OUTPUT
: OUTPUT
OUTPUT A
OUTPUT
EFT(1,32) RIGHT11.32) AFUNCIW3)
AGH(l,4) APHI1.4) ;
AGL(l,4) APL(l,4) ;
UM(l,32) ;
AC4H AC8H AC12H ;
AC4L AC8L AC12L ;
COUT ;
AC16 AGl AG2 API AP2 ;
12/00100
12/00200
12/00300
12/00400
12/00500
12/00600
12/00700
12/00800
12/00900
12/01000
12/01100
12/01200
12/01300
12/01400
12/01500
12/01600
12/01700
12/01800
12/01900
12/02000
12/02100
12/02200
12/02300
12/02400
12/02500
12/02600
12/02700
12/02800
12/02900
12/03000
12/03100
12/03200
12/03300
282
« rui<: inrir RESPONDS TO FRACTION ADDITION OVERFLOWS f^°°!^
n"si?"s- THE°F°ACnON ONE DIGIT TO THE RIGHT ON OVERFLOW
OVFLU.32) : FORM ONES! 1.3) ZERO SUMI1.28) ; 1W00400
FRACT ,32) : S15fl OVFLil.32) SUM. 1,32) OVFLSEL ZERO ; 3/00.00
OVFLCOnIi.S, : FORM ONESd.3) ACOUT ONES I I ,<» ; ^ ™
OVFLSEL : S151 OVFLCONC1.8) AFUNCK1.3) ; 13/00700
•> : OUTPUT OVFLSEL ; 13/00000
5 : OUTPUT OVFLI1.32) ! 13/00000
: OUTPUT FRACTll.32) ;
283
$ THIS IS THE
J THE EXPONENT
EXSEL1NI1.8) :
EXCNTKL(1,3) :
EXSEL : S151 E
10 : OUTPUT EX
100 : OUTPUT E
EX3T01L : S20
EX3T01CI1.2I :
EX3T0H1.8) :
20 : OUTPUT EX
: OUTPUT EX3T0
10 : OUTPUT EX
20 : OUTPUT EX
EXPSUMI1.4) UN
EXPSUMI5.4) CO
EXP(1,8) : S15
30 : OUTPUT CO
: OUTPUT EXP(2
EXPONENT
OY ONE W
FORM ONE
FORM EXC
XSELIN( 1,
SEL ;
XCNTRU 1,
EXSEL EXP
FORM EX3
SIG8263 E
EX3T01C
3T0K1.8)
1H EXPl ;
3T01C(l,2
3T01L ;
USED : SI
ZEROSll
RRCRY : S
ZEROS( 1
7 EXPSUMI
RRCRY ;
71 ;
CORRECTION AOOER, WHICH INCREMENTS
HEN A FRACTION OVERFLOW OCCURS
ZERO 0NES(l,2) ZEROSll, 2) ONE ZERO
2 AZERO BZERO ;
0) EXCNTRL(1,3) ;
31 EXSELIN(1,8) ;
1 ONE ONE ;
T01H EX3T01L ;
Xlll.8) AEXP0(1,3)
(1,2) ZERO ;
EXPSUMI1.8) ;
8EXP(1,8)
81 EX3T0111.4) ZER0S(i,4) CORRCRY
,4) ZERO ;
181 EX3T0U5.4) ZEROSll, 4) ZERO
,4) ZERO ;
1,8) EX3T0K1.8) OVFLSEL ZERO ;
14/00100
14/00200
14/00300
14/00400
14/00500
14/00600
14/00700
14/00800
14/00900
14/01000
14/01100
14/01200
14/01300
14/01400
14/01500
14/01600
14/01700
14/01800
14/01900
14/02000
14/02100
14/02200
281+
$ THIS LOGIC TESTS THE RESULT FRACTION
$ THE ZERO FLIP-FLOP ACCORDING TO THE
ZFFBITSl 1) : S260 FRACT(1,4) ;
ZFFBITSI2) . S260 FRACTI5.4) ;
ZFF8ITSU) . . S260 FRACTI9,*) ;
ZFFfUTSU) : S260 FRACT(13,4)
ZFF8ITS<5) S260 FRACTU7.4)
ZFFBITSI6) . S260 FRACT(21,4l ,
ZFFBITSl 7) : S260 FRACT(25,4>
ZFFBITS(O) : S260 FRACT (29,4)
ZFFINBAR : !> 133 ZFFBITS(1,8) ;
ZFFIN : S04 ZFFINBAR ;
*ZFF *ZFFBAI* : S74 ZFFIN CLOCK ;
10 : OUTPUT ZFFIN ;
10 : OUTPUT ZFFINBAR ;
20 : OUTPUT ZFFBITS<1,8) ;
: OUTPUT ZFI: ZFFBAR CLOCK ZFFIN ;
FOR ZERO, ANO SETS
RESULT OF THE TEST
15/00100
15/00200
15/00300
15/00<f00
15/00500
15/00600
15/00700
15/00800
15/00900
15/01000
15/01100
15/01200
15/O1J00
15/OH00
15/01500
15/01600
15/01700
285
//OECK EXEC ASSEMBLY, PARM='FX,ESO,LSETC=12«,REGION=180K 16/00100//SYSLia DO DSN=USER.P«293. SUPPORT, DISP»SHR 16/00700// 00 0SN=USER.P4293. PACKAGES, DISP-SHR
// DD 0SN=SYS1.MACLIB,DISP=SHR ~ It/nnfSn//SYSIN 00 OSN=GDECKFOG,DISP=JOLD, DELETE) L6/S05SS
286
//LINKDECK EXEC LI NKED I T, PARM= • L I ST , HAP, NCAL , LET • ,REGI0N=102K,
// L0ADSET=•USER.P'^293.LINK0UT(L0GF0G»•
//SYSLIB 00 DSN=USER.P*293.LINK0UT,01SP»SHR
//SYSLMO0 DO DISP=OLD,SPACE=(T«K, i 10,3,10)1
17/00100
17/00200
17/00300
17/00400
287
//MICRO EXEC ASSEMBLY, PARM^» • NOXREF , NOLREF, ESD« , REGION* 180K 18/00100
//SYSLI8 DO DSN=USER.P<t293. SPECIAL, DISP*SHR 13/00200
// DO 0SN=USER.P4293. MICRO, DISP=SHR 18/00300
// DO DSN=USER.P*293. SUPPORT, DISP=SHR 18/00400
// OD DSN=USER.P4293. PACKAGES, DISP=SHR 18/00500
// 00 DSN=SYS1.MACLIB,0ISP=SHR 18/00600
//SYSIN DD DSN=&MICF0G,DISP=10LD,DELETEI 18/00700
// DO 18/00800
288
19/00100
PRINT NOGEN . iQ/nniiU
STEP AEXSTRC=0,CUAO0,EX3T01H*UCLnCX^,AlNH = i,BlN».= l
STEP SHZE«O = l.NINH---l.CUArUNC = 0OO,CUADiJ=l,CUSU8»0.EXl!,7-O o/Sl
STEP ENABADD=0,EXPl«l.AUFUNC=OlO.EXCARRY*l Jo/XnJS
STEP EXPA=01001000,EXPB=01001000 19/00600
STEP A^X0,0-X0 19/00700
STEP AEXSTRC-l.CUAC-l,EX3T01M-0,CL0CK-0,NINH-0 IS/Soloo"
STEP CliAFUNC«0ll,ENABA0D-l.8-X0,EXHl-0,ABFUNC-0ll a/mooo
STEP EXPB=*EXP,AINH=O,8lNH=O,EXl57=l,EXCARRY=0 19/OU00
STEP AEXSTRC-0,CUAC-0.EX3T0lH-l.CL0CK«l,AIN.H»l.BINM-l [VnAloQ
STEP SHZERO-l,mNH=-l,CUAFUNC»000,CUAOD-l,CUSUB«0,EXli7-0
STEP ENABAD0-0,6XP1=1,ABFUNC=010,EXCARRY=1 19/01500
STEP CUAOD=0 19/01600
HUN 19/01700
STEP CUSUBxl 19/01800
RUN 19/01900
STEP A=X80000000 19/02000
RUN 19/02100
STEP 8=X5O0000OO 19/02200
RUN 19/02300
STEP EXPA=01000010 19/02400
RUN 19/02500
STEP EXPA^OIOOIOOO 19/02600
STEP CUSUB=0 19/02700
RUN 19/02800
STEP 8=X80000058 19/02900
STEP AEXSTRC-l,CUAC-l.EX3T01H-0.CLtiCK-0,NINH-0 JqySllOO
STEP CUAFUNC=0Li f ENABADD=l,B=X58,EXPl=0,ABFUNC=0ll \l£\\ aa
STEP EXPB = *EXP.AINH=0.BINH=0.EX157=UEXCARRY = [Y/oyiOO
STEP AEXSTRC-0.CUAC-0.EX3T01H-1.CL0CK-1.AINH-1.BINH-1' I9/035SJ
STEP SHZERO-l,NlNH=l,CUAFUNC=000,CUADO=l,CUSUB=0,tXlb7=0 lo/ovOO
STEP ENABADQ=0,EXPi=l,ABFUNC=010,EXCARRY=l 19/03700
STEP EXPB=010001U 19/03800
RUN 19/03900
STEP CUSU8=l 19/04000
RUN 19/04100
STEP CUAOO=l 19/04200
RUN 19/04300
STEP EXPB=01001000 19/04400
RUN 19/04500
STOP 19/046.00
END
289
//L1NKMIC EXEC LI NKED I T , PARM= • L I ST , MAP, LET' ,REG1 0N*102K, 20/00100
// 10ADSET= , USER.P*293.LINK0UT{MICF0G)' 20/00200
//SYSLIb DO DSN =USER.P<t293.LINKQUT,DISP=SHR 20/00300
//SYSLM00 DO DISP-OLD,SPACE*(TRK,( 10,3,10)) 20/00'»00
290
//LINKSIM EXEC UNKEDI T ,PARM=« LIST , HAP, LET ,RECI0N=102K,
// L0ADSET='USER.P4293.LINK0UTC TESTFOG)'
//SYSLfQ DD DSN=USER.P<t293.LINKOUT,DISP=SHR
//SYStlN DD *
ENTRY PROGRAM
rNCLUDE SYSLlS(MICFOG,LOGFOG)
//SYSLHOD DD D
I
SP=OLD, SPAC E= ( TRK , < 10, 3 , 10 )
)
21/00100
21/00200
21/00300
21/00*00
21/00500
21/00600
21/00700
291
//RUN EXEC PGM;=rESTFOG,REGlONU32K, TlME=( , 10) ,PARM= , 255«
//SYSPfUNT DO SYSCUT=A
//SYSUDU.HP UO SYSOUT=A
22/00100
22/00200
22/00300
292
//RUN EXEC PGM =TESTFOG,KEGION=32K,TIME=(,10),PARM='0« |wSo200
//SYSPRINT DO SYSDUT-A 23/00300
//SYSUOUMP 00 SYSUUT=A
BIBLIOGRAPHIC DATA
SHEET
1- Report No.
UIUCDCS-R-75-761
r I 11 '<• .m. I Subt it lc
AN ARRAY COMPUTER FOR THE CLASS OF PROBLEMS TYPIFIED
BY THE GENERAL CIRCULATION MODEL OF THE ATMOSPHERE
" MARVIN L. GRAHAM and D. L. SLOTNICK
I'ctlorming Organization Name and Address
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL 6l801
»onsonng Organization Name and Address
NASA Goddard Space Flight Center
2880 Broadway
New York, NY 10025
lementary Notes
3. Recipient's Accession No.
5. Report Date
December 1975
8. Performing Organization Kept.
N
°- UIUCDCS-R-75-761
10. Project/Task/Work Unit No.
11. Contract/Grant No.
US NASA NAS-5-23331+
13. Type of Report & Period
Covered Doctoral and
Final Report 1975
14.
6. Abstracts ———^^——___ ,
The goal of this research was the design of a computer suited to the class of
roblems typified by the general circulation model of the atmosphere. The needs thatprompted the research imposed several constraints on the design which was sought.i-imary among these was that the new machine has roughly 100 times the computing
capability of the IBM 360/95 which is now used in general circulation model research.Of equal importance, the cost of the machine was to be significantly less than that
01 extant machines with similar computing capability.
The design which is presented is that of an array processor similar in
architecture to ILLIAC IV. The design is in terms of commercially available TTL andECL small and medium scale integrated circuits. We believe that it achieves the costand performance goals set for it.
Key Aords and Document Analysis. 17o. Descriptors
array computer, parallel computer, general circulation model, switching network,l±os network, omega network, logic simulation, simulation of computer logic
»• Identifiers Open-Ended Te
COSATI Field/Group
1 Availability Statement
Unlimited
,M nt, s. 35 M0 . 70)
19. Security Class (This
Report)
UNCLASSIFIED
20. Security Class (Thie
Page
UNCLASSIFIED
21. No. of Page;
298
22. Price
USCOMM-DC 40329-P'l







t*
*n
. I .. m.-t^.i

