The architecture of tomorrow's massively parallel computer by Batcher, Ken
N87-26548
The Architecture of Tomorrow's
Massively Parallel Computer
Transcribed from an after-dinner talk given
by Dr. Ken Batcher
Goodyear Aerospace Corporation
on September 24, 1986
Goodyear Aerospace delivered the MPP to
NASA/Goddard in May 1983, over three
years ago. Ever since then we have tried to
look in a forward direction. There is always
some debate as to which way is forward
when it comes to supercomputer
architecture. In this talk, I will describe
improvements to the MPP's massively
parallel architecture in the areas of data I/O,
memory capacity, connectivity, and indirect
(or local) addressing.
I/0
Several years ago, Goodyear decided they
should advertise the fact that they are
something more than a tire company. They
started a series of ads. A particular ad
appeared in the Wall Street Journal a couple
years ago saying that our computer can add
and subtract 6 1/2 billion times a second
(that's on eight bit additions). Someone at
Goodyear thought up captions for the three
men in the ad. The first man says "How
long does it take to get the 6 1/2 billion pairs
of numbers into the computer." The second
man answers, "Oh, about a half an hour,
then you add and subtract them in 1
second." So the third guy says, "Well, I
hope they still make tires."
This points out the I/O problem of the MPP.
Actually, the problem is shared by most
supercomputers...they tend to be I/O
bound. At the conference today, the
speakers were saying that they were I/O
bound. Figure 1 shows the rates at which
data is a'ansferred between various parts of
the machine. The processing is going on
between the PE registers and the ARU
memory at 20,480 megabytes per second
(assuming 16,384 wires pushing data at 100
nanoseconds per bit), so you can see the
magnitude of the processing rate. Data
slides in and out of the ARU memory from
the side over 128, rather than 16,384 wires,
so you get 160 megabytes per second, still a
fairly respectable speed. In most computers
II_ lltlllm:_ IAII_
ll],lmlY
/0Jla
lfJ _IVl_$/_C
4_- 8 IIBYTI[S/3[C
0,7S ImYll[S/SEC
Figure 1
if you turn on all the I/O, the processing
would grind to a halt. It would take all the
memory cycles. On the MPP you can run
the I/O full bore and slow down the
processing rate by 1.6%. The processing
rate doesn't see the I/O.
The I/O situation gets worse once we've put
the data into the staging memory. Now we
want to move it to the VAX computer
through a DR780 channel, which is the
fastest way you can get data in and out of a
VAX but that's only at 6-8 megabytes per
second. This is where the half hour figure
comes from. You've got to put 6 1/2 billion
pairs of numbers through this DR780
channel, which takes about half an hour,
and the PEs would add them up in about 1
second. It's basically a limitation of the
VAX. And it gets even worse when you
look at what's the fastest way to get data in
and out of the VAX, (unless you change the
disk packs or something) you're limited to a
tape drive that records at 6250 bits per
inch---and 6250 bpi tape at 120 inches per
second gives you .75 megabytes per
t'ICEC,_DING PAGE BLANK NOT FILMED 151
https://ntrs.nasa.gov/search.jsp?R=19870017115 2020-03-20T09:47:16+00:00Z
second. The significant difference between
this .75 megabytes per second and the 160
megabytes per second rate out of the stager
is basically your I/O problem.
You've got several ways of solving the
problem, or at least making it less
noticeable. We did design a disk farm for
the MPP that would move data in and out of
the staging memory at the staging memory
rate of 160 megabytes per second. That's
one possibility. The other possibility is
some kind of high-speed network. If you
had a device that was generating data at
some fast rate, you could hook it directly
into the staging memory at up to 160
megabytes per second. You want to bypass
the VAX completely.
Another possibility is to increase the 160
megabyte per second rate to and from the
array. Figure 2 shows the ARU as it exists
right now. It has 128 columns, plus four
spare columns. The data comes in from the
left, goes out to the right, and is 128 wide at
160 megabytes per second input and output.
If you aren't satisfied with that rate, you
could divide the array up into four slices,
put data into each slice simultaneously, and
take it out of each slice simultaneously.
This would give you four times the I/O rate,
or 640 megabytes per second. If you want
to preserve the redundancy feature of the
array unit, you could add four spare
columns to each slice of 32, so your array
unit would have 144 columns in it, instead
of 132.
A
R
|II_IT Ill,IT
r_o MI_
OUTI_/T
(------ :za --_ I
].60 IqYTES/SFC
ou1'i_Jr
T
Figure 2
152
Memory Capacity
There seems to be a Parkinson's law of
computer memory that says no matter how
large you build the computer memory,
there's always a computer problem that will
overflow it. That's true of any computer,
whether it's a personal microcomputer or a
big supercomputer. The MPP is no
different. Some of the speakers today
talked about how they could use more
memory...so there is a memory problem.
In the original _pecification back in 1979,
NASA wanted 256 bits per PE for a total of
half a megabyte of ARU memory (see table
3). We figured that was too small, so what
we delivered was 1,024 bits, for a total of 2
megabytes of ARU memory. At that time
we could get 4xlK static RAM chips with
an access time of about 50 nanoseconds,
and those are what we used to implement
the MPP's ARU memory. We know that
memory technology is always growing, so
when we designed the machine we put in
16-bit addresses, so we could increase the
ARU memory size later when the memory
technology improved. Today, what we
would do is build a board with memory
sockets that could accept either 16K or 64K
memory chips. Right now the 4xl6K static
RAM chips are readily available. Actually,
the 4x64K RAMs are also available.
They've got a high price tag but in a few
years that should drop down. Today we
could supply memory boards with either
16K or 64K bits per PE. This would
increase the memory to either 32 or 128
megabytes. So you can get either 16 times
or 64 times your present capacity---which is
2 megabytes. It's therefore real easy to
expand ARU memory in the machine.
I1_ IIIB m
litre-am,re.s}
glUElmL _EC 256 |,5
IEI_IB II 191]; 1_ 2
_T_ _ESI$1v _" 16"Jm
L
Table 3
In 1983, we delivered four banks of staging
memory using 64K dynamic RAM chips, so
the staging memory had a capacity of 2
megabytes and an I/O rate of 20 megabytes
per second (see Table 4). We put 32 slots in
the cabinet so that it could be expanded to
32 banks. This year we did expand it up to
16 banks. At the same time, we changed it
to larger chips---256K chips. So right now
the staging memory is 16 times larger than it
was originally. The speed is 4 times
greater. We still have half the slots available
and at some time in the future we could
expand it up to twice as big and twice as fast
by populating all 32 slots. You can even go
further than that; 1,024K (1 megabit)
memory chips. These are starting to
become available. So we could change the
32 boards and put in chips that are four
times bigger and make the memory 256
megabytes. If you do that and you want the
faster I/O in the ARU you've also got to
feed it. You've got to take the staging
memory and make it 128 banks...this will
get the 640 megabytes per second. That
will also give you a gigabyte of memory for
the stager.
STA6116 MIMItV
lgg3
IMM
FUTL_
ntl_
Fk_11_ I/_
mm
If MES _IPS
,: 2
i:I:
r._lIY I/0 ILtIE._
(t_.qffllE_) (mrll[S_)
2 2O
_3 ii1
67 ltn
2B lie
1075 f_O
Table 4
Connectivity
It is real easy to change the architecture of
the machine for faster I/O and more memory
and still be compatible with the current
MPP. Thus, the modifications I have
described so far are still upwardly
compatible with the current MPP.
Programs wouldn't have to be changed to
use the larger memory capacity and faster
I/O. On the other hand, modifications to
connectivity and addressing could reduce
upward compatibility. Someone defined
upward compatibility as meaning that you
get to keep all the old mistakes. So if you
want to forget about being compatible and
look at other changes in the machine, then
we can talk about connectivity and indirect
addressing.
Figure 5 is a picture of the curreat ARU. It
has 16,384 processors and they
communicate with each other in the north,
south, east, and west directions over a
2-dimensionalmesh. Thisisgoodforthose
problemsinwhich communicationmust be
nearestneighborovera 2-D mesh. With the
topologyon theoutsideitcan be changed
intoa I-D mesh. And for 3-D problems
you can alwaysrun athirdimensiondown
therandom accessmemory, especiallyif
you make that memory larger...tosay
65,536 bits. So, you can treatthe l-D,
2-D,and 3-D problems withouttoomuch
trouble. However, there are a lot of
problemsthatre.quireotherconnectivities.
We did add thestagingmemory, which in
some respectshelpsbecauseifyou don't
likethemesh connectivity,ou canalways
move thedataout tothe stagingmemory,
rearrangeitand bringitback intotheARU
so thatitemsthatused to bc farapartare
now close together. So thisstaging
memory doeshelpsomewhat ingivingyou
more connectivity than the 2-D mesh.
"_ STOIV_GE
Figure5
Looking back at the history of Goodyear
Aerospace (see Table 6), some interesting
trends in connectivity arise. Back in the mid
1960's we did a study contract for Griffis
Air Force Base in Rome, New York, called
The Advanced Computer Organization
Study. We were looking at parallel
processors that have 100 PEs in them. We
figured that we wanted to hook anything to
anything else and the only way we knew
how to do this was with a sorting
153
network--which in some respects is kind of
an ideal--it will do any permutation by
sortinR.
RMmm
E
RID 50'$ AIPVMIOE) _ SOIITIlII6
OIGNII ZATIOI( STUDY ,,s_ lO0
EMLY70'S STklUIN 512-1024 256-VlDEFLIP
FJLY 80'S _ • _ 32-1111[FLle
Table 6
In the early 1970's we built some
STARANs that consist of about 1,000 PEa
and we connected them together with a 256
wide flip network, which is basicaUy the
same as an omega network or a butterfly
network...it has the same topology. So the
STARAN had less connectivity and more
PEs than we were looking at in the mid
'60's. Later, we built the ASPRO computer
with about 1,800 PEs and a narrower flip
network at 32 wide, that had less
connectivity than the STARAN. We then
delivered the MPP with 16,384 PEs and a
2-D mesh. When you look at this over
time, we have been increasing the number
of PEs and reducing the connectivity.
One way to explain the trend is to say
"Well, back in the '60's we were young and
more visionary than we are now. But now
we are more practical and have more
practical connectivities." I think the real
trend is that each of these projects was in
response to an RFP, and basically, we gave
the customer what he wanted. Back in the
mid '60's Rome was being more visionary,
and now NASA is being more practical and
asked for a 2-D mesh. This is an illustration
of the Golden rule, "Whoever has the gold
makes the _-,!es." Whatever the customer
wants is what we give them. That's what I
think is the real explanation of this.
Over the years, we have been looking at
connectivity networks other than the mesh.
If you had your druthers, you would like to
hook the 16,384 processors with a full
cross bar (see Figure7) so that any
processor could talk to any number of its
neighbors and could broadcast to any
number of its neighbors. Unfortunately,
that requires something like a quarter of a
billion cross points, so it's rather
impractical.
154
L
• b •
2M,_35,q_
INAL C_IIIICTI¥1TY
]
e
I
][ ]E
Figure 7
You can go various steps toward the ideal of
full connectivity (see Table 8) using two
approaches. You can take something like a
flip network or an omega network...several
people have talked about this network and
they all give it different names. I call it a
flip network, it has synonyms like omega
network, butterfly network, delta network,
etc.. They all have the same basic network
topology, it's just different names. To
connect n items together it takes n/2(log2n)
INTERCONNECTIONN_S
• FLIPNETWORK
N/2 LOG2NSWITCHES
1141,688S_ITCHESFOR16,38tl ITEMS
• BITOIIICSORTNETHORK
_EI.F_OKTR_ING
N/ll(L_)(I(_) _ITC[S
860,160 SWITCHESFOR16,384 IIEIqS
Table 8
switches. If you look at 16,384 processors,
then it takes 114,688 switches, which is
considerably less than the quarter billion
needed for ideal connectivity. This does
most of the useful permutations. For most
problems this is what you want. If you
want to be able to do any permutation of
16,384 items, then you would use a
network that is about twice as big, requiting
221,184 switches. Unfortunately, it takes
ORIGINALPAGE IS
OF POORQUALITY.
awhile to compute how to set up such a
network. If you waat to go one step furthex
you use the bitonic sort method, which in
some respects is self controlling. It will
compute the setting while the data is being
fed into it. And it takes n/4(log2n)(l+log2n)
switches, which is 860,160 switches. So
these are three ways of connecting the PEs,
depending on what you want communicated
between the processors.
The hypercube is basically somewhat like
the flip network (or omega or butterfly), it
has the same kind of complexity. These
methods are the basic ways of improving
the connectivity of the MPP, but then you
lose something in compatibility. Currently
your programs communicate north, east,
west, and south, so you would have to do
some work to use these other connection
schemes.
Indirect (Local) Addressing
all PEs must look at bit 43 in their
memories. This means that when we
program the machine, we look at it as a
bunch of memory planes and processing
planes. For example, you could take
memory plane number 43 and move it from
the memory into one of the processing
planes or store .a processing plane into a
memory plane. All the data in one plane
moves en masse in one cycle. So you look
at the machine as a bunch of memory planes
and processing planes. If you look at one
bit of a plane, then you're looking at all bits
of the plane.
In all the textbooks the MPP and the
Connection Machine are called SIMD
machines---I question whether they are
really SIMD machines. In the clawic SIMD
machine all the opcodes and addresses come
from a common control unit (see Table !0_
QJt_lOlt
Cm_tOL 811T (JIEN..LY$1S_).
Figure 9 shows what one MPP PE looks
like. The random access memory has a
global address that comes from the control
unit. In fact, if you look at how we
implemented the PE, all the logic (except the
memory) is on one chip, the PE chip, which
doesn't even see the memory address. The
FUNCTIONAL UNITS OF ONE PE
RIHD - OP(:ODESI ADORES_S_ LOCALPE
I_]_ltlES AI_ RE61$T_S,
SIM - OPf.OB[S _ CIOlIITtLIMIT.
MmE$_$ _ _ nBUtll_
I1[$1$W!$.
Table 10
FIIO¢_ PE 'tO Ill RANDOC_'tCC[$$ ADDRESS
Figure 9
only connections between the memory chip
and the PE chip are the data paths. The
global address goes to the memory chips
directly and doesn't know that there is a PE
chip at all. So the memory address is a
global address. This means that if one PE
wants to look at bit 43 in its memory, then
and this is true of the Connection Machine,
the MPP, and any bit-serial parallel
processor. Though all the textbooks call
this SIMD, I can argue that it is really SISD,
Single Instruction Single Data path. If you
look at a conventional computer, you have a
processor and memory and you take
memory words from the memory into the
processor and you store memory words.
The only difference is that our memory
words are 16,384 bits each. So you look at
this MPP (seeFigure 19, or the Connection
Machine, as a conventional computer that
happens to have a very large memory word,
in this case 16,384 bits per memory word,
and all we are really doing is moving
memory words back and forth between
processorsand memory. So you can argue
thatthe MPP and these other bit-serial
parallelmachinesarereallySISD machines.
Ifyou lookatwhat aMIMD machineis;the
opcodcsand addressescome from localPE
memories and registers.Each processor
generatesitsown addressesand opcodcs
from programs storedinitsown memory,
155
ARRA Y UNIT
NORTH
FROM STAGING MEMORY TO STAGING NglqlOA¥
coNN.,.----_.,o, 4//A
SOUTH
/.-i.--/
COLUMNS"
• 4 SPAR_
Figure 11
as opposed to the other case where they all
come from a common control unit. So I can
argue that a true SIMD machine would have
the opcodes come from the control unit and
still have a single instruction stream,
however, if you really want multiple data,
you should really have the addresses come
from local memories and registers. This is
somewhat in between the classic SIMD and
MIMD machines. So, I think the true SIMD
machine is one in which the addresses are
generated locally, and the opcodes are
generated in the control unit.
We found many problems where we really
would like to address the memory via a local
independent address rather than just through
a global address. One such case comes up
when trying to factor large numbers on the
MPP. If we are lying to factor a particular
large number, then what we do is look at
factors of 1,000 to 4,000 quadratic
residues. Table 12 shows an example of
4,000 quadratic residues. What we do is
find the prime factors of each of these. We
have maybe 4,000 primes in the horizoatal
dimension, and In'st we find what primes
factor these residues, i.e., we try to divide
them all by 2, then by 3, then by 5 and so
on. We set up a flag matrix that shows
where the division is exact. However, we
are not done then. We want to take out,
say, all the factors of 2, so if the number is
128, then we have to take out seven factors
of 2 from that number. If it is 243, then we
wm_
1 0 0 0 0 0 0 0 0 0 0
m 1 1 "1 0 0 0 0 0 1 0 0
_ 0 1 0 0 0 0 0 0 0 0 0
_ 0 0 0 0 0 0 0 0 0 1 1
_1 0 0 0 0 0 1 1 0 0 0 0
_3 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 0 0
have to take out five f_tors of three. One
way of doing this is to keep dividing by two
until they are all odd, then divide by three as
much as possible, and it can be seen that
this is a sparse operation, i.e., most of the
PEs are not participating. Each one of these
divisions is in a separate PE, and most of
them are not participating. We keep
dividing out the higher powers of these
primes. What we would like to do to
improve the situation is to pack the data
together so that we get the kind of
arrangement shown in Table 13. We now
start with let's say 15 columns for the prime
factors. And then we divide these numbers
by these numbers, do that until all the
powers have been reduced and then we go
on to the next column. We would rather do
something 15 times than 4,000 times, and
we would like to rearrange the data like that,
pack it together where it is sparse. This
improves the utilization of the machine.
1,11
N
t,,
m.3
l"
15 Mille F_
2
2 3 5 25
3
29 31
13 17
7
11 19 _3
2
Table 13
156
Wben we wcte lookingforaway todo this,
we foundthatwe couldusetheshiftregister
(see Figure9). The shiftregisterwas
originallyintheretoimprovethemachine's
times for multiplication,division,and
floatingpointoperations.However,itrams
out that the shift register can be used for
several other purposes. The shifting is
maskable, so I can shift some shift registers
and not shift others. So in effect, it turns
out to be a locally addressable memory
because I can turn the shifting of it on and
off. Unfortunately, it only has 30 bits in it,
I wish it were larger. So there is a small
quantity of locally addressable memory in
the MPP, and we have found out how to
use it to help some of these problems.
To improve the MPP, I would probably
replacetheshiftregisterwithaRAM. You
can simulatea shiftregisterwithina RAM
and thenyou can do otherthingswiththe
RAM. You would reallyliketomake the
whole memory locallyaddrcssable,butthen
you have some problems.Ifthisisoffon a
separatechip,thenyou havetoworry about
how you transmittheaddressfrom thePE
intothememory and back again.Iteither
takesa lotof pinsorsome othercircuitry.
The compromise istoput a RAM on the
chipand putanotherRAM offthechip.The
on-chipRAM isimplemented with VLSI
techniques.Memory manufacturersmake
alltheirmemory chipswiththeirown rules
togeta lotdensermemory chipsthanwhat
you can buildjustusingthestandardVLSI
design rules. If you stillwant a lotof
RAM, then you stillneed a standard
memory chip from a manufacturer.
Anyway, you couldbuildyourown locally
addressableRAM. The placeswhere you
would like to see thisis in Artificial
Intelligence(AI).
I was looking at AI problems on the MPP.
I found that this local addressability could
be used to help out. For example, if several
concepts are stored in each PE, while one
PE is looking at its third concept, another
PE may be looking at its fourth concept.
With global addressing that is hard to do.
With local addressability, it turns out to be a
lot easier.
When I got Danny Hillis' book, the first
thing I looked for was to see how they got
local addressability...I couldn't find it
anywhere. I know there are some
Connection Machine people here, and I
think that local addressability is a problem
and I would k to see a newer machine
havelocallyaddressableRAM.
Conclusion
In conclusion, I talked about four topics
(see Table 14). I/O---we can get transfer
rates up to 640 megabytes per second.
There are devices around that can supply the
data and accept it at this rate. We can
increase the memory capacity up to 128
megabytes in the ARU and over a gigabyte
in the staging memory. For connectivity,
there are several different kinds of
multi-stage networks that we can consider.
And we also should do something about
local or indirect addressing.
IMMT-oInPM" - 540 l_tl_/SEC
PEIqORYCAPACITY- L_ _'flE$ -- MU
10_ MIYI[S -- STAGINGI_IMII¥
(oIE_lYlrf - MULTI-STAlE
(IHOIIECT) MIiM - HIP tlCA. tUI_S_t,E IM
Table14
157
