Shared versus distributed memory multiprocessors by Jordan, Harry F.
NASA Contractor Report 187501
ICASE Report No. 91-7
ICASE
SHARED VERSUS DISTR_UTED MEMORY
MULTIPROCESSORS
Harry F. Jordan
Contract No. NAS1-18605
January 1991
Institute for Computer Applications in Science and Engineering
NASA Langley Research Center
Hampton, Virginia 23665-5225
Operated by the Universities Space Research Association
N_lional Aeron_ulic,_ and
Space Adminislralion
I_Angley Research Center
ttampton, Virginia 23665-5225
-(NASA-CR-I_I7501) SHARE r) VERSUS DISTRIBUT!_D
MEMORY MULTIPROCESSORS Final Report (ICASE)
ZO p CSCL OVB
G31aO
N91-18613
Unclas
0333712
https://ntrs.nasa.gov/search.jsp?R=19910009300 2020-03-19T19:42:08+00:00Z

Shared Versus Distributed Memory Multiprocessors*
Harry F. Jordan
ABSTRACT
The question of whether multiprocessors should have shared or distributed memory
has attracted a great deal of attention. Some researchers argue strongly for building dis-
tributed memory machines, while others argue just as strongly for programming shared
memory multiprocessors. A great deal of research is underway on both types of parallel
systems. This paper puts special emphasis on systems with a very large number of pro-
cessors for computation intensive tasks and considers research and implementation
trends. It appears that the two types of system will likely converge to a common form for
large scale multiprocessors.
*This work was supported in part by the National Aeronautics and Space Administration under NASA contract NAS 1-18605
while the author was in residence at ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton, VA 23665, and in part by
the National Science Foundation under Grant NSF-G87-17773.

What Are They?
The generic term parallel processor covers a wide variety of architectures, including
SIMD machines, data flow computers and systolic arrays. The issue of shared versus dis-
tributed memory arises specifically in connection with MIMD computers or multiproces-
sors. These are sometimes referred to simply as "parallel" computers to distinguish them
from vector computers, but we prefer to be precise and call them multiprocessors to
avoid confusion with the generic use of the former word. Some similar sounding but dif-
ferent terms are often used in a confusing way. Multiprocessors are computers capable
of running multiple instruction streams simultaneously to cooperatively execute a single
program. Multiprogramming is the sharing of a computer by many independent jobs.
They interact only through their requests for the same resource. Multiprocessors can be
used to multiprogram single stream (sequential) programs. A process is a dynamic
instance of an instruction stream. It is a combination of code and process state, e.g. pro-
gram counter and status word. Processes are also called tasks, threads, or virtual proces-
sors. The term Multiprocessing can be ambiguous. It is either:
a) Running a program (perhaps sequential) on a multiprocessor or
b) Running a program which consists of several cooperating processes.
The interest here is in the second meaning of multiprocessing. We want to gain high
speed in scientific computation by breaking the computation into pieces which are
independent enough to be performed in parallel using several processes running on
separate hardware units but cooperative enough that they solve a single problem.
There are two basic types of MIMD or multiprocessor architectures, commonly
called shared memory and distributed memory multiprocessors. Figure 1 shows block
diagrams of these two types, which are distinguished by the way in which values com-
puted by one processor reach another processor. Since architectures may have mixtures
of shared and private memories, we use the term "fragmented" to indicate lack of any
Shared Memory
Multiprocessor
Distributed Memory
Multiprocessor
Switch "x
_ Switch
I
Dance Hall Boudoir
Architecture Architecture
Figure 1: Shared and distributed memory multiprocessors.
1
sharedmemory. Mixing memoriesprivateto specificprocessorswith sharedmemoryin
a systemmaywell yield a betterarchitecture,but the issuescanbediscussedeasilywith
respectto thetwo extremes:fully sharedmemoryandfragmentedmemory.
A few characteristicsare commonly used to distinguish sharedand fragmented
memory multiprocessors.Starting with sharedmemory machines,communicationof
datavalues betweenprocessorsis by way of memory, supportedby hardwarein the
memory interface. Interfacing manyprocessorsmay lead to long andvariable memory
latency. Contributingto the!atencYis the fact thatcollisions arepossibleamongrefer-
encesto memory. As in uniprocessorsystemswith memorymodUleinterleaving, ran-
domizationof requestsmay be usedto reducecollisions. Distinguishing characteristics
of fragmented memory rest on the fact that communication is done in software by data
transmission instructions, so that the machine level instruction set has send/receive
instructions as well as read/write. The long and variable latency of the interconnection
network is not associated with the memory and may be masked by software which
assembles and transmits long messages. Collisions of long messages are not easily
managed by randomization, so careful management of communications is used instead.
The key question of how data values produced by one processor reach another to be used
. . =
by it as operands is illustrated in Fig. 2.
The organizations of Fig_ li_dlthe tr_smission mechanisms of Fig. 2 lead to a
broad brush characterization Of the differences in the appearance of the two types of
architecture to a user. A shared memory multiprocessor supports communication of data
(_)-_ write(loc. A) read(loc. A) @
Shared Memory
Switch
a) Shared Memory Communication
_(CPU_ send(proc. Y)receive(proc. X) -_-_
x j ,
Communications
1 Switch /
= b) Fragmented Memory Communication --
Figure 2: Communication of data in multiprocessors. =_
=
2
entirely by hardwarein thememory interface. It requiresshortanduniform latency for
accessto anymemorycell. Thecollisionswhich areinevitablewhenmultiple processors
accessmemorycanbe reducedby randomizingthereferences,sayby memorymodule
interleaving. A fragmentedmemoryswitchingnetwork involvessoftwarein datacom-
municationby way of explicit sendandreceiveinstructions. Data itemsarepackedinto
largemessagesto masklong andvariablelatency. Sincemessagesarelong, communica-
tionsschedulinginsteadof randomizationis usedto reducecollisions. To movean inter-
mediatedatumfrom its producerto its consumera fragmentedmemorymachineideally
sendsit to theconsumerassoonasit is produced,while asharedmemorymachinestores
it in memoryto bepickedupby theconsumerwhenit is needed.
It can be seenfrom Fig. 1 that the switchingnetwork which communicatesdata
amongprocessorsoccupiestwo differentpositionswith respectto theclassical,yonNeu-
mann,singleprocessorarchitecture.In sharedmemory,it occupiesa positionanalogous
to that of thememorybus in a classicalarchitecture.In thefragmentedmemorycase,it
is independentof the processorto memoryconnectionand more analogousto an I/O
interface. The useof send and receive instructions in the fragmented memory case also
contributes to the similarity to an I/O interface. This memory bus versus I/O channel
nature of the position of the switching network underlies the naive characterization of the
differences between the two types of network. A processor to memory interconnection
network involves one word transfers with reliable transmission. The address (name) of
the datum controls a circuit switched connection with uniform access time to any loca-
tion. Since a read has no knowledge of a previous write, explicit synchronization is
needed to control data sharing. In contrast, a processor to processor interconnection net-
work supports large block transfers and error control protocols. Message switching
routes the information through the network on the basis of the receiving processor's
name. Delivery time varies with the source and destination pair, and the existence of a
message at the receiver provides an implicit form of synchronization.
From the user's perspective, there are two distinct naive programming models for
the two multiprocessor architectures. A fragmented memory machine requires mapping
data structures across processors and the communication of intermediate results using
send and receive. The data mapping must be available in a form which allows each pro-
cessor to determine the destinations for intermediate results which it produces. Large
message overhead encourages the user to gather many data items for the same destination
into long messages before transmission. If many processors transmit simultaneously, the
source/destination pairs should be disjoint and not cause congestion on specific paths in
the network. The user of a shared memory machine sees a shared address space and
explicit synchronization instructions to maintain consistency of shared data. Synchroni-
zation can be based on program control structures or associated with the data whose shar-
ing is being synchronized. There is no reason to aggregate intermediate results unless
synchronization overhead is unusually large. Large synchronization overhead leads to a
programming style which uses one synchronization to satisfy many write before read
dependencies at once. Better performance can result from avoiding memory "hot spots"
by randomizing references so that no specific memory module is referenced simultane-
ously by many processors.
3
Why it Isn't That Simple
The naiveviews of the hardware characteristics and programming styles for shared
and fragmented memory multiprocessors just presented are oversimplified for several
reasons. First, as already mentioned, shared and private memories can be mixed in a sin-
gle architecture, as shown in Fig. 3. This corresponds to real aspects of multiprocessor
programs, where some data is conceptually private to the processor doing an individual
piece of work. The program, while normally shared by processors, is read only for each
and should be placed in a private memory, _ only fo r caching purposes. The stack gen-
erated by most compilers normally contains only private data and need not be in shared
memory. In addition, analysis done by many parallel compilers identifies some shared
data as read only and thus caclaal_|e-_np_ate memory. -_ome-muitiprocessors share
memories among some, but not all, processors. Examples are the PAX[l] and
DIRMU[2] computers. These machines move intermediate data by having its producer
place it in the correct memory and its consumer retrieve it from there. The transmission
may be assisted by other processors if producer and consumer do not share a memory.
Not only may a multiprocessor mix shared and private memories, but the same
memory structure may have different appearances when viewed at different system lev-
els. An important early multiprocessor was Cm*[3], built at Carnegie Mellon University.
An abbreviated block diagrarn of the architecture is Shown in Fig. 4. Processors were
attached by way of a local bus to memories and possibly I/O devices to form computer
T
I
K
I
I
P -- M .local
I
I
M
T T
I 1
K K
I .... __J
S (high concurrency)
P -- M .local
I
S (high Concurrency)
I i
M M
1
P -- M .local
t
Notation:
P - processor M, memory
S - switch K- controller
T - transducer (I/O device)
Figure 3: Shared plus private memory architecture.
E
4
otherK.map
I
S
I
M K
I
T
Cluster bus
I
S
I
I I I I
P (PDP-11) M K P (PDP-11)
I
T
Figure 4: Architecture of the Cm* multiprocessor.
modules. Several computer modules were linked into a cluster by a cluster bus. Proces-
sors could access the memory of other processors using the cluster bus. Processors in
different clusters communicated through interconnected mapping controllers, called
K.maps. The name K.map and some of the behavior of Cm* are easier to understand in
light of the fact that the PDP-11 had a very small physical address, so that address map-
ping was essential to accessing any large physical memory, shared or not.
Not only does Cm* illustrate a mixture of shared and fragmented memory ideas, but
there are three answers to the question of whether Cm* is a shared or fragmented
memory multiprocessor. At the microcode level in the K.map, there are explicit send and
receive instructions and message passing software, thus making the Cm* appear to be a
fragmented memory machine. At the PDP-11 instruction set level, the machine has
shared memory. There were no send and receive instructions, and any memory cell
could be accessed by any processor. The page containing the memory address had to be
mapped into the processor's address space, but as mentioned, this was a standard
mechanism for the PDP-11. A third answer to the questio n appeared at the level of pro-
grams running under an operating system. Two operating systems were built for Cm*.
The processes which these operating systems supported were not allowed to share any
memory. They communicated through operating system calls to pass messages between
processes. Thus at this level Cm* became a fragmented memory machine once more.
Taking the attitude that a machine architecture is characterized by its native instruc-
tion set, we should call Cm* a shared memory machine. A litmus test for a fragmented
memory machine could be the existence of distinct send and receive instructions for data
sharing in the processor instruction set. The Cm* is an example of shared memory
machines with non-uniform memory access time, sometimes called NUMA machines. If
access to a processor's local memory took one time unit, then access via the cluster bus
required about three units and access to memory in another cluster took about 20 units.
Writing programs under either operating system followed the programming paradigm for
S
fragmented memory multiprocessors, with explicit send and receive of shared data, but
performance concerns favored large granularity cooperation less strongly than in a truly
fragmented memory machine.
A more recent NUMA shared memory multiprocessor is the BBN Butterfly[4].
References to non-local memory take about three times as long as local references. The
Butterfly processor to memory interconnection network also contradicts the naive charac-
terization of shared memory switches. The network connecting N processors to N
memories is a multistage network with log2N stages, and thus (N/2)log2N individual
links. It thus has a potentially high concurrency, although collisions are possible when
two memory references require the same link. Read and write data are sent through the
network as messages with a self routing header which establishes a circuit over which the
data bits follow. Messages are pipelined a few bits at a time, and long data packets of
many words can use the circuit, once established. Thus, although single word transfers
are the norm, higher bandwidths can be achieved by packing data into a multiword
transmission. Messages attempting to reference a memory Which is in use, or colliding
with others in the switch, fail and are retried by the processor.
Finally, the naive view of the difference between implicit synchronization in frag-
mented rnernbi_ _d the need for e_p]i-cit-syhc_oniT.ati0n Wi_ shared memory should be
challenged. A shared memory synchronization based on data rather than control struc-
tures is that of asynchronous variableL =Asynchronous variables have a state as well as a
value. The state has two values, usually called full and empty, which control access to
the variable by two operations, produce and consume. Produce waits for the state to be
empty, writes the variable with a new value, and sets the state to full. Consume waits for
the state to be full, reads the value, and sets the state to empty. Both are atomic opera-
tions, or in general obey the serialization principle. Void and copy operations are often
suppli_ to ini_alize the state to empty_and to wait for full, read and leave full, respec-
tively. The HEP[5] and Cedar[6] computers supported these operations on memory cells
in hardware.
When data is put in memory by one processor using produce and read by another
using consUl, the ira nsa_tion behaves like a _new0rdmessage from producer to consu-
mer, with minor differences. The memory cell serves as a one word buffer, and may be
occupied when _produce is attempted_ '_e producer need not name the consumer;
instead, both name a common item as when send and receive are linked to a common
communications channel name. Another difference is that one produce and multiple
copys suffice to deliver the same datum to multiple receivers.
Abstraction :of Characteristics
The essence of the problem to be addressed by the switching network in both shared
and fragmented memory muitiprocess01"s is the communication of data from a processor
producing it to 0he which will use it. _i _ process can slow parallel computation when
either the producer is delayed in transmitting or when the consumer is delayed in receiv-
ing. This process can be abstracted in terms of four characteristics: initiation of transmis-
sion to the data's destination, synchronization of production and use of the data, binding
of the data's source to its destination, and how transmission latency is dealt with. Table
1 summarizes these characteristics and tabulates them for the traditional views of shared
6
Characteristics
Initiation
Synchronization
Binding
Latency
Fragmented Memory
Producer
Implicit by message existence
Processor name
Masked by early send
Shared Memory
Consumer
Explicit
Data name
Consumer waits
Table 1: Data Sharing in Multiprocessors.
and fragmented memory multiprocessors.
The initiation of data delivery to its consumer is a key characteristic and influences
the others. Producer initiated delivery characterizes the programming model of frag-
mented memory multiprocessors. It implies that the producer knows the identity of the
consumer, so that binding by processor name can be used, and provides the possibility of
implicit synchronization when the consumer is informed of the arrival of data. If a pro-
ducer in a shared memory multiprocessor were forced to write data into an asynchronous
variable in a section of memory uniquely associated with the consumer, the programming
model would be much the same as for fragmented memory. Consumer initiated access to
data assumes a binding where the identity of the data allows a determination of where it
resides. Since the consumer operation is decoupled from the data's writing by its pro-
ducer, explicit synchronization is needed to guarantee validity. One can imagine a frag-
mented memory system in which part of a data item's address specifies its producer and a
sharing protocol in which the consumer sends a request message to the owner (producer)
of a required operand. An interrupt could cause the owner to satisfy the consumer's
request, yielding a consumer initiated data transmission. Such a fragmented memory
system would be programmed like a shared memory machine. Binding is by data name,
and the consumer has no implicit way of knowing the data it requests has been written
yet, so explicit synchronization is required.
Too many explicit synchronization mechanisms are possible to attempt a complete
treatment, and sufficient characterization for our purposes has already been given. Since
message delivery is less often thought of in terms of synchronization, Table 2 summar-
izes the types synchronization associated with message delivery. Different requirements
are placed on the operating or run-time system and different precedence constraints are
imposed by the possible combinations of blocking and non-blocking send and receive
operations.
Types of binding between producer and consumer in fragmented memory systems
include: source/destination pair, channel, and destination/type. In the case of
source/destination, the send operation names the destination and receive names the
source. A message can be broadcast, or sent to multiple receivers, but not received from
multiple sources. Source thus designates a single processor while destination might
specify one or more. Message delivery can also be through a "channel" or mailbox. In
this_case send and receive are connected because both specify the same channel. A chan-
nel holds a sequence of messages, limited by the channel capacity. To facilitate a
receiver handling messages from several sources, a sender can specify a "type" for the
message and the receiver ask for the next message of that type. The source is then not
explicitly specified by the receiver but may be supplied to it as part of the message.
7
Message
Synchronization
Send:nonblocking
Receive:nonblocking
Send:nonblocking
Receive:blocking
Send:blocking
Receive:nonblocking
Send:blocking
Receive:blocking
System
Requirements
Messagebuffering
Fail returnfrom receive
Messagebuffering
Terminationdetection
Terminationdetection
Fail returnfrom receive
Terminationdetection
Terminationdetection
PrecedenceConstraints
None,unlessmessageis
receivedsuccessfully
Actionsbeforesendprecede
thoseafterreceive
Actionsbeforereceiveprecede
thoseaftersend
Actionsbeforerendezvous
precedeonesafterit
in bothprocesses.
Table 2: Summary of the types of message synchronization.
Binding in shared memory is normally by data location, but note that the Linda[7] shared
tuple memory uses content addressability, which is somewhat like the "type" binding just
mentioned.
• ' : 2:: e : : :
The problem of latency in sharing data and how ii:_s:dealt with is the most impor-
tant issue in the performance of multiprocessors. At the lowest level it is tied up with the
latency and concurrency of the switch. Two slightly different concepts should be dis-
tinguished. If Ts is the time at which a send is issued in a message passing system and
Tr is the time at which the corresponding receive returns data, then the latency is
TL = Tr -Ts. The transmission time for messages often has an initial startup overhead
and a time per unit of information in the message, of the form ti + ktu, where k is the
number of units transmitted. The startup time ti is less than TL, but is otherwise unre-
lated. In particular, ifTL is large, several messages_can be sent before the first one is
received. The granularity of data sia_ng isdeterminedby the relationship of ti to tu . If
t i >> tu good performance dictates k >> 1, making the granularity coarse. If ti - tu the
fine granularity case of k = 1 suffers little performance degradation. Read and write in a
shared memory switch must at least have small ti _ thatdata transmissions with small k
perform ................well. A fine granularity switch witffsma_ startup ti may still have a large
latency TL, and this is the concern of the fo_ characteristic in Table 1.
Lateficy must grow with the num_r of processors in a system, if only because its
physical size grows and signal transmissi6n_s limited by the speed of light. AS the Sys-
tem size grows, the key question is how the inevitabl_ latency is dealt with. An architec-
ture in which latency does not slow down individual process0rs as the number of them
increases is called scalable. Scalabi!ity is a function both of how latency grows and how
i(_S managed, Message latency can be masked by overlapping it with useful computa-
tion. Figure 5 shows a send/receive transaction in a fragmented memory system. In part
a) message latency is successfully overlapped by computation in the consumer whereas
in part b) the consumer does not have enough to do before needing the data in order to
completely mask the latency. In reference to Fig. 5, scalability is is a function of how the
8
T_
[
Producer
Channel
Produce
intermediate
result
Send
to
consumer
Message
latency
Compute
Consumer Computeindependentof producer
a)Messagelatencywell maskedby computation.
Receive
intermediate
result
Time
Producer
Channel
Consumer
Produce
intermediate
result
Send
to
consumer
Compute
Message
latency
Receive
Compute Wait for message intermediate
result
Time
b) Poorlymaskedmessagelatency.
Figure5: Maskingmessagelatencyby computation.
programdoing the sendsandreceivesis organized. The ratio of available overlapping
computation to message latency decreases as system size grows, both because latency
grows and because computation is more finely divided.
In shared memory multiprocessors the consumer initiation of access to data when
needed eliminates the possibility of arranging the program so that sends occur early
enough to mask latency. Latency can be managed in this case, as in the other, by reduc-
ing it or by masking it with useful computation. Latency reduction in the shared memory
hardware regime is done by caching and latency masking by pipelining or multiprogram-
ming. In the naive view, scalability is a hardware concern in shared memory but more a
function of program structure in fragmented memory, leading to the notion of software
scalability. Assuming infinitely fast transmission, networks with P processors and a rea-
sonable number of switching nodes usually have latency on the order of logm P, where m
is the number of input and output ports per switch node. If finite speed of signal
transmission is an issue, latency is proportional to the cube root of P for a system build-
able in three dimensional space and to the square root of P if messages occupy volume.
Concurrency of the switch also has an influence on latency. It must clearly have a
concurrency much greater than one for any multiprocessor with more than a very few
processors. Using a single bus for this switch is inadmissible in all but the smallest of
systems. For scalability, concurrency should grow linearly with the number of
9
processors;otherwisethelack of physicalnetworkpathswill leadto long latencieswhen
many processorsuse the switch simultaneously.Even with order P links, collisions
between messages can occur under unfavorable access patterns. The way to control col-
lisions is a function of granularity. In a fine granularity network, randomization which
distributes the small transactions uniformly over the network is usually appropriate.
With large granularity transactions, randomization is less effective, and scheduling of the
transactions may be required.
Thus the abstract differences between shared and fragmented memory multiproces-
sors rest on the four characteristics of Table 1, with the selection of producer or consu-
mer initiation of data delivery having a strong influence on the other three. Consumer
initiation is naively associated with explicit synchronization, data name binding, and
latency reduction. Producer initiation suggests implicit synchronization, processor name
binding, and latency tolerance by executing sends early.
Convergence
The direction of current developments in shared and fragmented memory multipro-
cessors is generally toward convergence. The desire to write programs with a shared
name space for fragmented memory machines is Supported both by research on virtual
shared memory using paging techniques and by automatic compiled or preprocessed gen-
eration of sends and receives for remote data references. Multiprogramming the nodes of
a fragmented memory multiprocessor can also increase the amount of computation avail-
able to mask latency. Virtual processors make use of the idea of parallel slackness, or
using of some of a problem's inheren tp_!¢lism to control latency. In shared memory
multiprocessors, considerable work is being applied to muitlprocessor caching, which
distributes shared data among processors to reduce latency. Hardware cache manage-
ment, software cachability analysis, and correct placement and copying in NUMA
machines have been considered. Much attention haS been given to fast, packet switched,
multistage interconnection networks for use in the processor to memory interface, and
pipelining techniques have been applied to tolerate the inevitably large latency of such
networks connecting many processors.
Support for a shared name space on fragmented memory muitiprocessors takes
several forms. Li[8] has considered using paging techniques to produce a shared
memory address space on a fragmented machine, if the paging is heavily Supported by
h_warel convergence is easily seen_tween ithi s w_ork an d the work on multiprocess0r
caching exemplified by [9]. Another approach uses program analysis to automatically
generate the sends and receives required to move data from its producer to its consumer.
For regular access patterns, the user can specify data mapping and a language like
DINO[10] can generate message trans_ssi0ns to satisfy non-local references. When
regular access patterns are generated by loops in automatic parallelization of a sequential
program[11], the more constrained structure allows even more of the mapping and data
movement to be generated automatically by the compiler.
Automating data mapping acrossdistributed memories has a long history and might
be typified by the work of Berman and Snyder[12]. If access patterns are data dependent,
as in computations on machine generated grids, they may still be constant over long
periods. It may then be beneficial to bind addresses and generate data movement using a
10
z
7
7
___=
E
preprocessor[13]which acts at run-time, after data affecting addressesis known, but
beforethe bull of the computation,which is often iterative, is carriedout. Preprocessor
work can thusbeamortizedovermany iterationswith thesameaccesspattern. Conver-
gentwork for sharedmemoryhas takenplacein connectionwith NUMA architectures.
The BBN Butterfly providessupportfor placementandcopying to reducethepenaltyfor
longmemoryreferences.Softwareplacesprivatedata in the local memoryof its proces-
sorandrandomizesreferencesto structuressuchasarraysovermemorymodulesto avoid
memory"hot spots"[14].
Finally, convergencein latencyhiding techniquesis seenbetweentheuseof virtual
processorsin fragmentedmemoryandpipelining in sharedmemorymultiprocessors.If
we attemptto useconsumerinitiation in fragmentedmemoryby interruptingthe owner
of a datumwith a requestfor transmission,we seea behaviorlike that of Fig. 6 a). In
orderto makeuseof the long wait resultingfrom consumerinitiation of thedelivery, the
processorexecutingtheconsumerprocesscanbe switchedto anotherprocess,asshown
in Fig. 6 b). If theprocessis associatedwith adifferent program,we havethe standard
techniqueof masking latency by multiprogramming,which is used in masking disk
latencyin virtual memorysystems.If theextraprocessis associatedwith the sameparal-
lel program, we have a partly time multiplexed form of multiprocessing often
Producer
Channel
Consumer
Compute,including
productionof
intermediateresult
Request
intermediate
result
Request
message
Send
to
consumer
Reply
message
Wait
Compute
Compute
a) Latency results in consumer wait.
Time
Producer
processor
Channel
Consumer
processor
Process A: compute
and produce
intermediate result
Send
to Process A: compute
consumer
Request Reply
message message
Process B: Process B:
request Process C: compute compute
result
Time
b) Latency masked by multiprocessing.
Figure 6: Consumer initiated transmission in a fragmented memory system.
11
characterizedby the namevirtual processors.The useof virtual processorsto enhance
performancehasrecently beenmost frequentlydiscussedin relation to an SIMD archi-
tecture,the ConnectionMachine[15],where it is important for maskinglatency arising
from severaldifferent sources.If eachprocessorof afragmentedmemorymultiprocessor
time multiplexes severalprocessesso that messagelatency in the communicationnet-
work is overlappedwith usefulcomputation,a time snapshotof messagetraffic andpro-
cessoractivity mightappearasin Fig. 7._
An early useof multiprocessingto maskmemorylatency,asopposedto I/O latency,
was in the peripheral processorsof the CDC 6600116].Ten slow peripheralprocessor
memorieswereaccommodatedby time multiplexing tenvirtual processorson a singleset
of fast processorlogic. Processcontextswere switchedon a minor cycle basis. Later,
the DenelcorHEP usedfine grainedmultiprocessingto masklatencyin a shared,pipe-
lined datamemory. The conceptof pipe!ined multiprocessingis illustrated in Fig. 8.
Roundrobin issuingof a setof pr_ess statesinto theunifiedpipeline is doneon a minor
cycle basis. Processesmakingmemory referencesarequeuedseparatelyto be returned
to theexecutionqueuewhensatisfiedl Pipeline interlocksare largelyunnecessarysince
instructionswhich occupythepipelinesimultaneouslycomefrom differentprocessesand
canonly dependoneachotherthroughexplicit synchronization.
For latency to be maskedby satisfyingrequestsat a higher rate than processor-
memorylatency would imply, many requestsmustbe in progresssimultaneously.This
implies a pipelined switchbetweenprocessorsandmemory,andpossiblypipelining the
memoryalso. Pipelining andsingleword accesstogetherimply a low overhead,message
switchednetwork. Variable traffic in thesharednetworkrequiresacompletionreport for
each transaction,regardlessof whether it is a read or write. Whether the memory
modules themselvesarepipelined or n_0t_dependson the ratio of the module response
time to the step time of the pipelined switch. If the memory module responds
significantly slower than the switchingnetworkcan deliver requests,memorymapping
Processor1 Processor2 Processor3 Processor4
RunWail
Active Q Q
Send Rcv
Run Wait
Active Q Q
Send Rcv
Run Wail
Active Q Q
Send Rcv
Switch node
_ n°6_l_[_witch _k"
Run Wail
Active Q Q
[ IPll
Send Rcv
Switch node I Switch node
Figure 7: Masking Message Transmission with Multiprogramming.
12
i
F-
U_
Pipelined
Switch
II
Memory
Reference Queue
Execution Pipeline
_--( ( ( ( ( ( (9--
I I Process
Queue
Figure 8: One execution unit of a pipelined multiprocessor.
and address decoding are obvious places to use pipelining within the memory itself. Fig-
ure 9, which bears an intentional resemblance to Fig. 7, shows an activity snapshot in a
system built of multiple pipelined multiprocessors which mask the latency of multiple
read and write operations in the processor to memory switch.
Processor 1 Processor 2 Processor 3 Processor 4
Run Run Ru_Running_ Running_ _ Running_ _ Running
I P5 IPll
I
Mem. I Mem. Mem. Mem.
I
Figure 8: Pipelined Multiprocessors in a Shared Memory Multiprocessor System.
13
Convergencecanalsobeseenin switchingnetworkresearch.Packetswitchedpro-
cessorto memory interconnectionssuchas that in the NYU Ultracomputer[17]bear a
strong resemblanceto communicationnetworks used in messagepassingdistributed
memorycomputers.Previously,the storeandforward styleof contentionresolutionwas
only seenin communicationsnetworkscarrying informationpacketsmuch larger than
one memoryword. There is also a strongresemblancebetweenthe "cut-throughrout-
ing"[18] recently introducedin fragmentedmemorymultiprocessorsand the previously
mentionedheaderswitchedconnectionsmadeby messagesin the BBN Butterfly shared
memory switch.
Conclusions
The question of what one concludes from all this is really a question of what one is
led to predict for the future of multiprocessors. The predictions can be formulated as the
answers to three questions: What wiil be the programming model and style for multipro-
cessors? How will the systern architecture support this modeiof computation? What
will be the split between hardware and software in contributing to this system architec-
ture? _- ......
The programmer will surely referenc_a global name space. This feature
corresponds too closely to the way we formulate problems, and too much progress has
been made toward supporting it on widely different multiprocessor architectures, for us
to give it up. It also seems that most synchronization will be data based rather than con-
trol based. Associating the Synchrorffzation with the objects whose consistency it is sup-
posed to preserve is more direct and less error prone than associating it with the control
flow of one or more processes. Programs will have more parallelism than the number of
physical processors in the multiprocesSor expected to run them, with the extra parallelism
being used to mask latency.
Multiprocess0r architecture will consist of many processors connected to many
memories. A portion of the memory will be globally interconnected by way of a high
concurrency switch. The switch will have a latency Which scales as logm P for moderate
speed systems, with m probably greatei than two. For the highest speed systems, the
latency will scale as P 1/2. Multiprocessors Will use a Harvard architecture, separating the
program memory from data memory to take advantage of its very different access pat-
terns. Data memory private to each processor will be used to store the stack, other pro-
cess private data and copies of read only shared data. Only truly shared data will reside
in the shared memory.
A combination of software and hardware techniques will be used to mask the
latency inherent in data sharing. Compiler analysis will be the main mechanism for
determining what data is truly shared. It may even generate code to dynamically migrate
data into private memories for a long program section during which it is not shared. The
hardware will time multiplex (pipeline) multiple processes on each processor at a very
fine granularity in order to support iatency masking by multiplexed computation. Some
of the multiprocessor cache research may find use in partially supporting the data migra-
tion with hardware, but a knowledge of reference patterns is so important to data sharing
that it is unlikely that the hardware will forego the increasingly effective assistance avail-
able from the compiler.
14
In short,the hardware,assistedby the compiler,of multiprocessorsystemscando
muchmorethanwecurrentlyaskof it. Moving softwaremechanismsinto hardwarepro-
ducesa significantperformancegain, and shouldbe donewhen a mechanismis well
understood,proven effective and of reasonably low complexity. Finally, although
automaticparallelizationhasbeenpoorly treatedin this paper,it is perhapspossibleto
saythat, in spiteof the excellentwork donein turning sequentialprogramsinto parallel
ones,a user shouldnot take greatpains in a new program to artificially sequentialize
naturallyparalleloperationssothattheycanbedoneonacomputer.
15
[1]
[2]
[3]
[4]
[5]
REFERENCES
T. Hoshino, "An invitation to the world of PAX," IEEE Computer, V. 19, pp. 68-79
(May 1986).
W. Haendler, E. Maehle and K. Wirl, "DIRMU multiprocessor configurations,"
Proc. 1985 lnt' ni Conf. on Parallel Processing, pp'-652-656 (Aug. i985).
E.F. Gehringer, D.P. Siewiorek and Z. Segall, Parallel Processing The Cm* Experi-
ence, Digital Press, Billerica, MA (1987).
R.H. Thomas, "Behavior of the Butterfly parallel processor in the presence of
memory hot spots," Proc. of the 1986 Int'nI Conf. on Parallel Processing, pp. 46-50
(Aug. 1986).
J. S. Kowalik, Ed., Parallel MIMD Computation: The HEP Supercomputer and its
Applications, MIT Press (1985).
[6] D. Gajski et al., "Cedar," Proc. Compcon, pp. 306-309 (Spring 1989).
[7] S. Ahuja, N. Carriero and D. Gelemter, "Linda and friends," IEEE Computer, V. 19,
pp. 26-34 (1986).
[8]
[91
[131
K. Li, Shared Virtual Memory on Loosely Coupled MuItiprocessors, Ph.D. Thesis,
Yale Univ. New Haven, C_ (Sept. 1986).
J.-L. Baer and W.-H. Wang, "Multilevel cache Hierarchies: Organizations, proto-
cols, and performance," J. Parallel and Distributed Computing, V. 6, No. 3, pp.
451-476 (June 1989).
[10] M. Rosing, R.W. Schnabel and R.P. Weaver, "Expressing complex parallel algo-
rithms in DINO," Proc. 4th Conf. on Hypercubes, Concurrent Computers & Appli-
cations, pp. 553-560 (1989).
[11] D. Callahan and K' _Kennedy, "Compiling p_grams for distributed-memory mul-
tiprocessors," J. ofSupercomputing, V. 2, pp. 131-169 (1988).
[12] F. Berman and L. Snyder, "On mapping parallel algorithms into parallel architec-
tures," Proc. 1984 lnt' nl Conf. on Parallel Processing, pp. 307-309 (1984).
J. Saltz, K. Crowley, R. Mirchandaney and H. Berryman, "Run-time scheduling and
execution of loops on message passing machines," J. Parallel and Distributed Com-
puting, V. 8, pp. 303-312 (1990).
[141
[15]
[16]
R. Rettberg and R. Thomas, "Contention is no obstacle to shared-memory multipro-
cessing," Communications of the ACM, V. 29, No. 12, pp. 1202-1212 (Dec. 1986).
L.W. Tucker and G.G. Robertson, "Architecture and applications of the Connection
Machine," Computer, V. 21, pp. 26-38 (Aug. 1988).
J. E. Thornton, Design of a Computer: The Control Data 6600, Scott, Foresman and
Co., Glenview, Ill. (1970).
16
[17] A. Gottlieb, R. Grishman,C.P.Kruskal,K.P. McAuliffe, L. Rudolph andM. Snir,
"The NYU Ultracomputer--Designingan MIMD sharedmemory parallel com-
puter,"IEEE Trans. on Computers, v. C-32, No. 2, pp. 175-189 (Feb. 1983).
[18] W.J. Dally and C.L. Seitz, "The Toms routing chip," Distributed Computing, V. 1,
pp. 187-196 (1986).
17
Report Documentation Page
N_llOr_l _,_c_'_aullC s aqel
_p,Ee -_Om,nGIralOr'
I. Report No.
NASA CR-187501
ICASE Report No. 91-7
4. Title and Subtitle
2. Government Accession No. 3. Recipient's Catalog No.
5. Report Date
SHARED VERSUS DISTRIBUTED MEMORY MULTIPROCESSORS
7. Author(s)
Harry F. Jordan
9. Pedorming O_anization Name and Address
Institute for Computer Applications in Science
and Engineering
Mail Stop 132C, NASA Langley Research Center
Hampton, VA 23665-5225
12. Sponsoring Agency Name and Address
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23665-5225
January 1991
6, Performing Organization Code
8. Performing Organization Report No.
91-7
10. Work Unit No.
505-90-52-01
11. Contract or Grant No.
NASI-18605
13, Type of Report and Period Covered
Contractor Report
14. Sponsoring #,gency Code
15. Supplementary Notes
Langley Technical Monitor:
Richard W. Barnwell
Final Report
To appear in Proc. of European Centre for
Medium Range Weather Forecasts workshop on
Use of Parallel Processors in Meteorology,
Nov. 26-30, 1990.
16. Abstract
The question of whether multiprocessors should have shared or distributed mem-
ory has attracted a great deal of attention. Some researchers argue strongly for
building distributed memory machines, while others argue just as strongly for pro-
gramming shared memory multiprocessors. A great deal of research is underway on both
types of parallel systems. This paper puts special emphasis on systems with a very
large number of processors for computation intensive tasks and considers research
and implementation trends. It appears that the two types os systems will likely con-
verge to a common form for large scale multiprocessors.
17. Key Words (Suggested by Author(s))
multiprocessors, shared memory, distribu-
ted memory
19. SecuriwTClassif. (of this report)
Unclassified
18. Distribution Statement
60 - Computer Operations and Hardware
61 - Computer Programming and Software
62 - Computer Systems
Unclassified - Unlimited
20. SecuriW Cla_if.(ofthispa_i 21. No, ofpa_s
Unclassified 19
2.2. Price
A0 3
NASA FORM 1626 OCT 86 NASA-L_ngIey, I991
L
E
m
