Rediflow Multiprocessing by Keller, Robert M. et al.
Claremont Colleges
Scholarship @ Claremont
All HMC Faculty Publications and Research HMC Faculty Scholarship
2-1-1984
Rediflow Multiprocessing
Robert M. Keller
Harvey Mudd College
Frank C. H. Lin
Jiro Tanaka
University of Tsukuba
This Conference Proceeding is brought to you for free and open access by the HMC Faculty Scholarship at Scholarship @ Claremont. It has been
accepted for inclusion in All HMC Faculty Publications and Research by an authorized administrator of Scholarship @ Claremont. For more
information, please contact scholarship@cuc.claremont.edu.
Recommended Citation
Keller, R.M., F.C.H. Lin, and J. Tanaka. "Rediflow multiprocessing." Computers for artificial intelligence applications Ed. B. Wah and
G.J. Li. (1986): 329-336. The article first appeared as Keller, R.M., F.C.H. Lin, and J. Tanaka. "Rediflow multiprocessing." Proceedings
of IEEE Compcon (Feb. 1984): 410-417.
Rediflow Multiprocessing
Robert M.Keller
Un iversity of Utah
and
Lawrence Livermore National Laboratory
Frank C,H. Lin
Un iversity of Utah
JiroTanaka
University of Utah
Abstract
We discuss the concepts underlying Rediflow, a
multiprocess ing system be ing designed to support
concurrent programm ing through a hybrid model of
reduction , dataflow, and von Neumann processes. The
techniques of automatic load-balan cing in Rediflow are
described in som e detail.
1. Introduction
'Redif low" is the name we give for a collect ion of ideas
relating to multiprocessor system design and attendant
software capabil ities. The name is an elis ion of
'reduct ion" and "dat af low", two models for evaluation of
functi onal languages on mu lt iple processors. As shall be
seen, our conception also includes disciplined aspects of
the "von Neumann" evaluation model as wel l.
1.1. Language and Software Issues
The motivation for use of funct ional languages stems from
the fact that programs expressed in them usually contain a
fair amount im pli c it concurrency, yet are determinate, or
speed-independent, in that t hey are guaranteed to give
the same re sults no matter how many processors are
involved in their exe cu tion, and do so independently of the
physical aspects of communicat ion between those
processors. As such, these languages seem to be ideal
for the programming of multiprocessors when little
concern over the distinct ions between them and
uniprocessors is desired. The determinacy criterion is
essential for most applications, while intended exceptions
can be handled w ith minor extensions to it . For example,
we have successfully programmed distributed database
applications, including those involving concurrent updating.
Functional languages have other conceptual advantages,
but space does not permit a lengthy discussion of them
here. We cla im that other types of languag es, such as
sequential languages and languages for logic
programming, have an essentially funct io nal cha racter, and
can be appropriately co m bi ned or embedded in them to
share the advantages mentioned. One such combination
is discussed herein.
Thia work has been supported by grants from th e IBM corporation,
National Scien ce Foundat ion (MCS-B106177), Defense Advanced Research
Projects Agency of th e US Department of Defenae (contract no. MDA903-81-
~414) and, at LLNL, by U.S. Department of Energy (contr act no. W-7405-
ENG-48).
Reprinted from The Proceedings ojCOMPCON 5 '84 , 1984 , pages 4 10-417 .
U.S. Governme nt work not protected by U.S . copyrig ht. 329
1.2. Hardware Issues
The re presently seem to be no conc ept ua l d ifficulti es in
interc onnecti ng conglomerates of 'processo rs of arbit ra ri ly-
high peek processing capacit ie s. Un for tunately, it is
another matter to obta in useful wo rk fr om suc h
conglomerates . In order for a conglomerate to qualify as
a system, it is necessary to provide linkages between the
conception of problem solving and the hardware. Such
techn iques must be expressed in appropriate co m put er
languages, which are then mapped for execution.
Cons iderations of device techno logy be ing held invariant,
it is the ease in performing th is mapping wh ich
determines the relat ive suc cess of a multiprocessing
system.
The ease of mapping depends on the cl ass of appl ications,
the languages used, compilers, and t he underlying
hardware configurat ion. Clearly, for a f ixed appl ication, a
special pu rpose machine can be des igned which w ill out-
perform all others on ' that application. Our goal here is
not to address such machines, but rather to develop
techn iques which explo it multiprocessing power for a wide
range of applications. A class of appli cations can be
characterized by the regularity, size span, and granul~rity
of its members.
Applicat ions of high regularity contain many very similar
operations which present similar computational demands.
For these, approaches such as ve ctor processors or
cellular arrays may be most appropriate. The static data
flow approach [6] also appears useful for applications of
very high regularity, but less so for others, since it relies
on a rather balanced pipeline approach to achieve
speedup together with high utilization.
The size span characterist ic of a set of applications
relates to the extent the problem size is apt to vary over
the lifetime of the system. Wh ile a particular array
processor may be ideal for problems which can be
contained in one array load, there may be extreme
difficulties in "f o ld ing" larger problems to match the
processor configuration . Even if such folding can be
accompl ished, it may result in signifi cant unused
processor cycles if not done with finesse.
The third differentiating characteristic mentioned is
granularity. This term refers to the indecomposable units
of work distributed to processors. Fine - gr ain operations
w ould be on the level of bit ope rat ions, w hile sli ght ly
larger gra ins would be arithmetic operation s. Large grains
would be the level of processes or entire job s. Red iflo w is
aimed at appli cations appropriate for medium (and larger)
granularity, in w hich w e int end to in clu de irregul arly-
st ructured problems such as are fo und in, but not
restricted to , the field of artific ial intelligence. Other
applications of medium granularity are certain adaptive
numerical calculations, certain types of signal processing,
and combinations of several application areas which
interact in unpredictable ways.
2. Granularity considerations
Two areas of tradeoff which exist when considering
granularity are communication overhead, and flexibility in
load balancing. We first discuss communication. ' One
reason why systems do not usually operate with peek
speedup is that there must be some data communication
between granules. This form of communication is minimal
in the equ ivalent purely sequential computation.
Therefore, attention must be given to reducing it in
concurrent execution. For very small gra ins, the delay due
to communication may exceed the delay of the operations
themselves. For this reason, small granularity is not
exploitable unless the regularity is so high that necessary
communication paths are sho rt and static, or unless there
is little communication between grains. Widely distributing
many small grains increases the likelihood that overhead
due to communication will be large. Our conscious
attempt at clustering small operations is one issue on
which we seem to differ with other dataflow-related
approaches (e.g. (281). In summary, favoring large grains
minimizes delay due to communication, but does so at the
expense of loss of speedup due to concurrency.
As mentioned, another factor influencing the choice of
granularity is load balancing, by which we mean the
distribution of grains to the processing units. The ideal
situation is a single initial expenditure involving sending
equal-size grains to all processing units. However, it will
seldom be po ssible to make such determinations a priori.
Instead, many applications will present work loads which
are data dependent, and thus not sus cept ible to static
analysis. To fully exploit the available multiprocessing
re sources, thus attain ing ma ximum speedup, we need to
have the ability to dynamically di stribute load. Here we
must pay attention to the tradeoff w hi ch favors small
grains for the ability to balance more evenly, but which
favors larger ones to minimize the tot al effort in actual
distribution.
An area of concern often ment ioned in relat ion to
granularity is that of context switching, i.e. saving a
processor's registers when it switches its att ent ion from
one unit of work to another, before the former unit is
complete. This is therefore a technique for effectively
reducing the grain-size, particularly if there is need to vary
priority among large grains which may become temporarily
330
inactive due to data dependencies (e.g. a process Waiti
for an i/ o request to complete). As such, it seems fair n9
lump this overhead with that of load balancing. to
3. System organization issues
In addition to intended application granUlarity
multiprocessing systems can be class ified according ,to
processor-memory structure. At one extreme, we hav
"dancehall" configurations, wherein one can imagine the
system as hav ing processors lined up along one side of e
large dancehall, and memories along the other, with a
network of switches in between. At the other extreme a/
"boudo ir" configurations, in which each proceSSor i
e
closely paired with a memory, and a network of SWitcheS
is used to communicate between such pairs . S
Dancehall configurations appear to provide a uniform time
access of any processor to any memory. However, this
un iformity may disappear if there is significant contention
at individual switches. Unfortunately, this delay also
becomes uniformly longer with increasing numbers of
processors and memories. It is possible to introduce
caches which are coupled closely with processors and
which retain local inf o rm at ion for faster access, however
this introduces the difficult problem of "coherence"(cf.
[7 , 24]): when one processor wishes to update
information which has been cached by another, the latter
must be invalidat ed, which entails additional
commun ication overhead. Any machinery int roduced to
overcome this problem has a diluting effect on the useful
capac ity of the system. Boudoir configurat ions avoid this
problem, since each processor has exclus ive control over
it s own memory. This control also obv iates introduction
of spec ial in structions for multiprocessor memory access ,
such as test-and-set and it s derivati ves (10).
4. Locality
The connection of a single processor with its memory has
been pejoratively called the "von Neumann bottleneck" [1].
However, we are convinced that it is a powerful device, to
be exploited as much as possible. A large number of such
"bott lenecks" operating concurrently gives a very high
aggregate bandwidth, mu ch higher than a dancehall
configuration with the same number of processors and
memories, and with less attendant latency of memory
accesses. Of course, these processor/memory pairs do
not usually operate in isolation; however, if the
communication between the components of the pair occur
much more often than commun ication between pairs, in
wh ich case we say there is a high locality, then the
boudoir configuration w ill be superior. It is conjectured
that applications of medium grain and larger usually do
possess sufficient locality to make the boudoir approach
attractive .
When using large numbers (hundreds, to tens of
thousands) of processors, it is not attractive to employ a
centralized task queue from which processors seek work.
One reason for this is that such a queue creates a
bottleneck, and is contrary to reliability considerations. A
nd, more subtle, reason is that such a queue tends to
seeO . h d' lb t ' ftroy locality, in that it homogenizes t e tstri u Ion 0
des As an alternative, we propose in Section 8 a methoddata.
. which not only is the work distributed to the
In esso rs but in wh ich the method it self is alsoproc '
distributed.
5. Evaluation models
eral evaluation models have been suggested as thesev .
basis for multiprocessor .execution. The r:nost convent ional
f these entails extendmg the sequential von Neumann~xecution model to "processes" w~ich. run concurrently,
but with various forms of communication between them.
This method is a large-grain one, and has been most
ecessful when processes are preass igned to phys ical
su h frocessor- mem ory pairs [9]. Related, but muc mer-
Pained are "dat af low" approaches, in which operat ions
greh as arithmetic are distributed to multiple functionw . .
units and operands streamed through logical locations
whieh feed such units. These have a potentially very high
degrees of concurrency, but care must be taken , lest
potentia l speedup be absorbed by com m unic at io n
overhead .
Another type of models is called "reduct io n", in w hich both
the program and data are treated as an int egrat ed, but
distributed data structure. The spreadi ng of this st ruc tu re
over the available processors permits it s co ncurrent
transmu tation at many sites. A fine -grain st ri ng - reduc t ion
multiprocessor has been described by Mago [23] . Another
string -reduction multiprocessor, of med ium granularity, is
presented in [20]. A medium-grain mu lt iprocessor based
on graph reduction is described in [14]. The approach of
Rediflow is an extension of the latter.
In the reduction model of evaluation, no resident reg isters
are employed, so cost of context switch ing is kept to a
minimum, enabling rapid mult iplexing of existing processor
load in an effort to generate more load for concurrent
execution. On the other hand, when the system is
SUfficiently loaded, such multiplexing should be abandoned
in favor of more conventional sequential execution. One
means of achieving this effect will be discussed later.
S.l. Evaluation by Graph Reduction
Our particular reduction evaluator can be derived from the
lambda-calculus as a theoretical basis [4]. If one begins
with a simple lambda calculus evaluator operating on
string substitution, and introduces optimizations such as
the use of po inters to sub-express ions rather than
manipulating sub-expressions themselves. one is lead to a
graphical, rather than string, representation . Attempts to
make efficient the copying (which ar ise out of function
applications, or equivalently "bet a- reduct io ns") inherent in
this graph representation lead to the use of a linearized
segment representation, in which the operator nodes of
the graph correspond to words in the segment. and lists
of addresses relative to the beginning of the segment
represent arcs from the corresponding nodes. Values of
"free" variables are imported in vectors, rather than
331
employing an "assoc iat ion list" wh ich must be searched
repeatedly. More details on such representations may be
found elsewhere [14, 5, 15]. It is worth noting that so-
called combinator implementations [2 7] are also a fo rm of
f ine-grained graph reduction .
A computation in this model beg ins as a sing le graph, with
the result of one node "dem anded" The de mand then
propagates to other nodes, som e of w hich are primitive
operators and others of w hich are def ined by graphs of
the ir own. The latter are expanded by virtually repl ac ing
the nodes w ith the defining graphs. A schem e sim ila r to
one described in [14] is employed fo r performing this
v irtual replacement through global addre ss linkage s.
As an example, suppos e there . is a tree-struc tured
database distributed in the memory layer. This dat ab ase
may have been generated by so m e prio r progr am. or
explicitly loaded. Suppose further that w e have a number
of funct ions f l , f2. f3 , .... each a "speciali st " in perf orm in q a
certa in kind of search on the database. For example, one
funct ion m ight produce a certa in "view" of the database, a
sub-tree of nodes with a pre-spec if ied property. A
second might compute an aggregate funct ion on the
database, such as the number of positive nodes. A third
might produce a transformed copy of the database, w hi ch
is structurally the same, but having node values defined
accord ing to some mapping on ind ivi dual nodes. All these
funct ions could be performed concurrently on the
database, each potentially recursively spli tt ing into other
function instances, perhaps creating it s own data, which is
made available for higher-level instances of the funct ions
or for output. It is also possible for one funct ion to be
us ing another's output while the latter is being computed,
rather than after it is computed. Such phenomena have all
been demonstrated in Rediflow.
An advantage inherent in the reduct ion model is that all
synchronization for the above activit ies is implicit in the
underlying funct ional language implementation. This
removes a considerable burden from the programmer.
Whenever "strict" functions are involved (funct ions which
requi re all of their arguments), the spawn ing of the
necessary activit ies takes place automatically. This is a
strong contrast to process-oriented models, in wh ich
there are three separate endeavors:' sett ing up procssses..
synchronizing them, and using their va lues.
6. Integration of von Neumann Processes
Despite the advantages of the reduction model mentioned
above, there remain aspects of applications which cannot
exploit its inherent concurrency, synchronization, etc . It is
not uncommon to f ind segments of app lications which
have a high peak concurrency, but have many internally
sequential embedded segments. Such phenomena have
been known since the earliest discuss ions of co ncurrent
computat ion.
The effect of applying machinery powerful enough for
concurrent computation to inherently sequential segm ent s
is dilution of overall speedup. For example, a major
difficulty with the reduction model is its memory
intensiveness. It imitates an elegant mathematical model
of functional languages, in which data values are never
modified in place; they are only created, and destroyed (by
storage reclamation). To do this for every conceivable
operation means that much time is spent recycling
storage. Although a certain amount of this can be done
concurrently with other processing, the overhead seems to
remain significant. A desirable goal is therefore to
combine the load-spreading potential of reduction with
other methods which are not so storage intensive. The
approach taken in Rediflow entails what we call "von
Neumann processes". These are encapsulated sequential
processes which communicate with their environment in
special ways. Externally, they appear as a form of
"dataflow" functions, an observation made by Kahn [131.
internally, von Neumann processes are ordinary sequential
programs; operations appearing to be file input and output
("get" and "put") are used to communicate internal data
values to and from the environment in the form of
"tokens" moving on channels .
A key difference between our implementation of von
Neumann processes and the suggestion of Kahn is that
our implementation does not automatically supply
unbounded buffers as channels . Instead, infinite buffers
which can be attached to channels are naturally
implemented in the reduction portion of the model, the
operations of which are based on data structures rather
than token passing. The interface from a von Neumann
process to a reduction-implemented function solidifies a
stream of token values into a stream data structure, while
the interface from a reduction-function to a von Neumann
process does the opposite. These functions are quite
similar to ones which are used for external stream i/o in
implementations of the reduction model [19].
Software networks of only one type of function can be
connected together arbitrarily, the interface functions
being used to connect networks of different types. The
integration of the two models is done in such a way that
what would have been merely arcs in distributed data
structures can function as logical communication channels
between processes. As such, the int eg rat ion combines the
"structure" and "t oken" models described in [5linto one
unified system. Further details are given in [261.
As an example of the use of von Neumann processes,
consider a function which performs the "APL-reduction" of
a sequence (assumed non-empty) by a binary operator g
(assumed non-associative), i.e . if the sequence is lxl. x2,
...., xnl the result is g[ ....g[g[x 1, x2l, x31 xn]. A pure
reduction implementation would likely use an
"accumulating" function such as (expressed in our
language FEL [191)
332
reduce[g, x] =
{
result red Hhead: x, tail: x]
redl[accum, y] =
if Y =[ ]
then accum
else redl[g[accum, head:y], tail:y]
Such an implementation would create n-l instances of
red 1. This is inefficient, even with a built-in "tail
recursion" optimization, since a von-Neumann process
could evaluate the same function by the folloWing
sequential program :
reduce[g, x l =
{*
var y, accum;
accum := head:x;
y := tail:x;
while y <> [] do
begin
accum := g[accum, head:y];
y := tail:y
end;
return accum;
*}
The above example does not demonstrate the use ot
channels. If the sequence were tokens from a channel,
rather than components of a data structure, the
corresponding von Neumann process might be
reduce[g, x] =
{*
var y, accum;
accum := get:x;
while more:x do
accum := g[accum, get:x];
return accum
*}
Evaluation of the effectiveness of von Neumann processes
is demonstrated in [26). It should be noted, however, that
they would not be superior if the operator g were
associative, and data structures used for the sequence
permitted easy concurrent decomposition (ct . [161). In this
case, a divide and conquer approach could be used, and
for such, the reduction model seems well-suited.
7. Phys~al Configu~t~n
As mentioned earlier, Rediflow currently assumes a
configuration in which a number of processor-memory
pairs are interconnected via a switching network. The
combination of such a pair with an appropriate packet
switch for information transfer will be called an Xputer, a
primitive sketch of which is shown in Figure 7-1 .
swi t ch packe t s t o/ f rom
ne i 9hbor s
globally addressable address space. If on e Xpute r need s
to access the memory of anot her, it forms a reque st
pac ket cont aining th e address to be accessed. That
packet is th en routed wi t hi n t he sw itc h lay er t o the Xpute r
co nt aini ng the addressed lo cation . A re sult packet is then
formed, which is then routed to the reque sting Xput er .
Th is request/return mechani sm is integrated w it h t he
demand-drive mechanism of reduction evaluation, so that
remote triggering of fun ction evaluations can take pla ce.
proc ess or
meno ry
Figure 7-1: Sketch of an Xputer
The exact form of the Xputer network is not too important
in this exposition. For a small number of nodes, say up to
a few hundred, a rectangu lar grid interconnection should
be adequate (see Figure 7-2). Input/ou tput devices, which
are not shown, may be attached at any nodes. For larger
numbers of nodes, an interconne ction topology with a
lower worst-case del ay is attra ctive . The concept s
expressed in this paper can be used in such a system
without modification.
~ = one Xputer
Figure 7-2: Sketch of an Xputer network
We can think of an Xputer grid as forming a plane surface,
With the switches, processors, and memories each form ing
logically parallel layers. The layers need not be physically
paralle l. Interconnection exists only at the sw itch layer,
While the memories in the memory layer have a co mbined
333
8. Load distribution and balancing
The refusal to rely upon a centralized queue for the
distribution of work load mean s that ot her di stribution
methods must be used. As st at ed earlie r, the sm all er the
grain, the more effective lo ad balan cing can be m ade . To
avo id granularity so sm all that communication delays
become sig ni f icant, we aim at "medium " granular ity. The
approach taken in Rediflow is based on the contenti on
that, ideall y, grains beh ave as molecule s of fluid be ing
poured over a "surf ace" of processor-memory pairs. The
reduct ion model enables t his granu larity, but w e st ill need
a mean s of making the flu id model wo rk. We use the
ana logy of pressure t o explain ho w th is is done.
As wi t h mo st multipro ce ssor orga nizat io ns, qu eues are
used to ho ld the bac kl og o f wo rk. In o ur cas e, the item s
on these queues are calle d chares (small t asks). Am o ng
several other queues to be described , each Xput er has a
queue call ed th e apply queue which is the reservoir of
migrable chares. Ch ares on thi s queue re pres ent
fun ction -in st ance s w hic h may be done on any avai labl e
Xputer, due to th e gra nulari t y and addr essa bil it y
assumptions st at ed earl ier . Each such chare ca rr ies a
"c los ur e" w hic h po ints to both a block represent ing the
code of the fun ction and a t uple of "i m port ed" values, in
addition to the actual argument, whi ch may also be a
tuple . Pure cop ies of such code blocks are cached locally
in an Xputer, following an in it ial fetch from secondary or
resident storage. There is no a priori correspo ndence
between logical code and physical Xpu t er , and the same
funct ion may be executed in many different Xputers.
The number of chares on an Xput er' s apply queue,
we ighted together with other resource ut iliz at io n m easures
such as memory usage, can be thought of as defining it s
internal pressure. For t he moment, assu me that the
Xputer can sense not only th is pressure, but al so the
pressures of its neighbors, some function of which is
called the external pressure. When the internal pressure
sufficiently exceeds the external, some chares from the
apply queue may issue fo rth into the int erconnect io n
network, where they are di stributed to Xputers with lower
pressures. In fa ct, Rediflow employs a moderately-
int ell igent switch, which is capable of directing chares
along pressure gradients to find such low points. When a
chare reaches an Xputer with a local pre ssure minimum, it
is absorbed into its apply queue. Thi s tends to ra ise the
pressure of that Xputer, and les se n the likelihood that it
will rece ive more chares, until it s int ern al pre ssure
becomes lower due to complet ion of work.
The phenomenon of saturation occurs when all Xputers
are suff ici ent ly bu sy that any attempt to migrate apply-
chares w o uld be futile, despite pre ssure differentials. An
addit ional aspect o f the Rediflow load balanc ing
mechan ism is the det ecti on of such saturat io n. When
external pr essure is suffic ient ly high, migration attempts
cease.
Obviously, pr essure of Xput ers is conti nuall y changing .
Accord in gly, it is ne ce ssary to co nt inually update each
Xput er' s sense of its environm ental pre ssure. This is done
throu gh a sam pling proc ess, in w hich the sw itc h of each
Xpute r co m put es transmitted pressure as a func t ion o f
t rans m itted pre ssu res of its neighbors. One heuristic
w hich seem s to w ork w ell is to define th e trans mitted
pre ssure t o be 0 if the Xputer's inte rn al pr essure is below
a certa in thre sh old , and 1 + the min im um of the
neighbor's tran sm itted pressur es othe rwise, w it h an
abs olute maximum o n the order of the diameter of the
netw ork. Thi s has th e de sired effe ct o f permitting chare s
to flow toward the lea st loaded node.
9. Throttling
As mentioned earlier, an advantage of the reduction model
of computation is that concurrently-executable work is
easily spawned for migration to other processors. In
effect, a "t ree" is grown w hich corresponds to a single
expression from which the "o ut put" of the running
program is extracted on a co nti nui ng bas is . The default
mode of servi c ing each Xput er' s apply queue is FIFO,
which traverses the tree breadth first and thus has the
virtue of reach ing concurrently executable nodes earlier.
However, w hen saturation conditions exist, an Xputer
switches to LIFO t o give depth first traversal, in order to
throttle it s rate of chare production. Th is is helpful for
avo id ing queue overflows and for reducing the possibility
of over-commitment of memory space, which could result
in a kind of deadlock. (This suggest io n was also made in
[3].) In sat urat ed mode, operators which would normally
demand arguments concurrent ly are changed to demand
them sequent ially. This is easy to do within the reduction
model. Finally, certain eagerness operators are normally
compiled into the reduction code to cause anticipatory
demands to components of suspended data structures [8)
for added concurrency [17). Eagerness operators are
ignored in sat urat ed mode.
10. Garbage collection
Distributing add ressable memory across many modules, as
is done in Red iflow, necessitates a distributed garbage
collecto r. Although there are several candidates wh ich
suggest them selves, our current approach is to use a
copying garbage collector [2) . In our di stributed variation
of this approach, the entire address space is divided in
half, w hich appears as a halving of the memories in each
Xputer. Allocation takes pla ce wi t hi n each Xputer from
successive locations of it s half-space. (Incidentally, the
occupation level of this half-space contributes to the
Xputer's internal pressure.) When all space is used up ,
accessible records are cop ied to the other half-spaces in
334
the same Xpu ters . In thi s w ay, the di stribution of data
w hich is necessary for conc urrent processing is
mainta ined.
Distributed co py ing can be ac hieved by packe ts , enabling
all Xput ers to hav e an act ive ro le In garbage CO llect ion
concurrent ly. (This packet - o ri ented im plem ent at ion is not
present in our current sim ulator). A rel ated approach is
used in Halstead 's "Concert" [111. alt ho ug h he uses a
"real- t ime" co ll ec t o r w hic h tri es to co m pac t to one Xputer,
rather than preserving the data distribution. Our approach
can be converted to a real-time one, using methods
sim ilar to those described in [121 however we have not
yet worked out these details.
Another aspect of garbage collection to be exploited
concerns load balancing when von Neumann processes are
involved. Due to their constantly regenerating nature, the
latter are typically mu ch larger-grained than functions
evaluated by pure reduction, and therefore som ew hat of a
hindrance to dynamic load-balancing . Nonetheless, the ir
contribution to Xputer pressure can be assessed by the
presence of their components on int ern al queues, and they
can be shift ed from one Xput er to another with reasonable
ease during garbage collect ion, w hen addresses are
remapped anyway. The simulation of this form of
balanc ing is currently not done.
11. General Packet Flow
A rough overview of the organization of an Xputer as
expla ined above may be found in Figure 11-1. This
diagram assumes that pressure sam pling info rm at ion is
sent through the switching layer in the form of packets,
which are int erm ingled w it h packets of other varieties
(containing apply chara s, data requests and responses, and
garbage collection messages). This assumption has been
used in most of our simulation results so far. However, it
is also possible to dedicate a separat e serial channel to
the transmission of pressure information.
A fetch packet is issued by an Xputer which needs to get
a datum from a location in another Xputer. It contains the
address of the datum, and a return address. When a fetch
packet arrives at the other Xput er' s in-queue, the location
is checked for containing valid data, and if so, a forward
packet is created wh ich returns the value to the first
Xput er. However, it may be that the data have not yet
been produced, in w hich case the return address is
reserved in the seco nd Xput er until such a t ime as the
data are available. Al so, if production of the data has not
yet been demanded, it wi ll be demanded at that t ime.
Because all result data have pre-allocated globallY-
addressable location s, it is not necessary to use any form
of "to ken matching" [28) to get the data to their
destinations. Thus, fa st von Neumann-style memory is
internally exploited in each Xput er. The use of addressing
also permits rout ing tables to provide the short est
possible route to be chosen through the switching layer.
pressure and apply packets
L.- fetch and
forward packets
f i fo
}
packets
t-t;--.--r--_.....c::::;.,...... to
neighbors
apply
packets
load
manager
fifo
fifo
fetch and
forward packets
packets
from
neighbors
apply local
processor
memory
Figure 11-1: Packet flow within a Rediflow Xputer
12. Performance evaluation
The performance of the Rediflow architecture is being
evaluated using simulation. As with most studies in their
formative stages, we have begun evaluating speedups
using an introspective model , i.e. one in which speedups
are measured aga inst a single processor with the same
technological assumptions, architecture, and evaluation
rnodet as the multiprocessor. Due to certain needed
improvem ent s in our model, we are not yet ready to beg in
challenging existing sequential processors for applications
with low degrees of concurrency. However, if the
potential concurrency is high, then we believe Rediflow
can exploit it with a demonstrated speedup.
We have been running two kinds of benchmarks. One
consists of "t oy" programs which exhibit a single kind of
activ ity, such as pure "divide and conquer". The other
Consist s of more "realist ic" appl ications w hich combine a
number of activities, in the areas of sim ple database
~earc~ing and upd.atin g, and correlative signal processing.
o briefly summarize, we have measured speedups in the
range of 1 to 8 for the realistic applications, with fewer
than 32 Xputers, and of up to 30 with the toy programs
~ith as many as 128 Xputers. Memory space in our
Simulator is currently a principal limiting factor. In the
process, we have demonstrated that the load distribution
techniques designed for Rediflow apparently work well.
They do exploit locality, in that typically over 50% of the
data packets, and 80% of the apply packets, traverse paths
of length at most 2. Usually fewer than 15% of all
operations performed need to communicate outside one
Xputer. We have also observed that the switches
hypothesized for Rediflow do not seem to be a bottleneck
under current technological assumptions.
13. Future Work
In addit ion to continuing our on-going evaluation and
improvement of the basic Rediflow system, we are
widening the investigation of application areas. For
example, we and colleagues are in the process of
including means of concurrently evaluating logic programs
(cf. [22, 25]) .
We also intend to engage in stud ies of reliabil ity. An
added feature of the mathematical model underlying
functional evaluation is that data are never destroyed.
making such a model a natural candidate for expressing a
recovery model [18. 21]. This. coupled with our contention
that physical configuration is apt to be more gracefully
335
degradable than a dancehall configuration, make Rediflow
ap attractive candidate for a reliability investigation. We
hope to prove this, and other concepts discussed, through
one or more physical multiprocessor realizations in the
next few years.
14. Conclusions
We have presented a collection of ideas being integrated
into a multiprocessing system called Rediflow, which
employs a packet-switching network to implement higher-
level pro gramming abstractions aimed at efficiently
runn ing medium-grained applications with high degrees of
conc urrency. We have discussed a technique for load-
balan cing in an essentially-distributed system. Finally, we
have explained preliminary results on performance of
Red iflow.
References
[11 J. Backus. Can programming be liberated from the
von Neumann style? A functional style and its algebra of
programs. Communications of the ACM 21(8) :613-641 ,
August, 1978.
[2] H.G. Baker, Jr. List processing in real time on a
serial computer. Communications of the ACM
21(4):280-293, April , 1978 .
[3] F.W. Burton, M.R Sleep. Executing functional
programs on a virtual tree of processors. In Functional
programm ing languages and computer architecture, pages
187-195. October, 1981.
[4] A. Church . The calculi of lambda-conversion.
Princeton University Press, 1941.
[5] A.L. Davis and R.M. Keller. Dataflow program graphs.
IEEE Computer 15(2):26-41, February, 1982 .
[6] J.B. Dennis. Data flow supercomputers. IEEE
Computer 13(11):48-56, November, 1980 . --
[7] M. DuBois and Faye A. Briggs. Effects of cache
coherency in multiprocessors. IEEETC C-31 (11): 1083-1 099,
November, 1982 .
[8] D.P. Friedman and D.S. Wise. CONS should not
evaluate its arguments. In Michaelson and Milner (editors),
Automata, Languages, and Programming, pages 257-284.
Edinburgh University Press, 1976.
[9] E.F. Gehringer, AX Jones, and Z.Z. Segall. The Cm "
testbed. Computer 15( 10):40-49, October, 1982.
[10] A. Gottlieb, et al. The NYU Ultracomputer-Designing
an MIMD shared memory parallel computer. IEEETC
C-32(2):175-189, February, 1983. --
[11] Robert Halstead. private communication, MIT, 1983 .
[12] P. Hudak and RM. Keller. Garbage collection and
task deletion in distributed applicative processing systems.
In Proc . Canf. on Lisp and Functional Programming, pages
168-178. ACM , ACM, August. 1982.
336
[13] G. Kahn . The semantics of a simple language for
parallel programming. In Information Processing 74, pages
471-475. IFIPS, North Holland, 1974.
[14] RM. Keller, G. Lindstrom, and S. Pati! oA loosely-
coupled applicative multi-processing system. hi AFIPS
Conference Proceedings, pages 613-622. June, 1979 .
[15] R.M. Keller and G. Lindstrom. Hierarchical analYsis Of
a distributed evaluator. In Proc. International Conference
on Parallel Processing, pages 299-310. August, 1980.-
[16] R.M. Keller. Divide and CONCer: Data structuring for
applicative multiprocessing. In Proc. 1980 Lisp Conference
pages 196-202. August, 1980 . -=
[17] RM. Keller and G. Lindstrom. Applications of
feedback in functional programming. In Conference on
funct ional languages and computer architecture, pages
123-130. October, 1981.
[18] RM. Keller and G. Lindstrom. Aooroaching
Distributed Database Implementations through Functional
Programming Concepts. Technical Report, University of
Utah, Department of Computer Science, 1982 .
[19] RM. Keller. FEL (Function Equation Language)
Programmer's gu ide. 1982 .AMPS Technical Memorandum
No.7.
[20] W.E. Kluge. Cooperating reduction machines. to
appear in IEEETC , 1983 .
[21] Frank C.H. Lin . A distributed load balancing
mechanism for applicative systems. December, 1983.PhD
Thesis Proposal, Department of Computer Science,
University of Utah.
[22] Lindstrom, G. and Panangaden, P. Stream-Based
Execution of Logic Programs. In Proc. 1984 Int'!. Symp. on
Logic Programming. February, 1984 . (to appear).
[23] G. A. Mago. A Network of Microprocessors to
Execute Reduction Languages, Part I. International Journal
of Computer and Information Sciences 8(5) :349-385,
March, 1979.
[24] C.V. Ravishankar and J.R Goodman. Cache
implementation for multiple microprocessors. In Compean
'83, pages 346-350. IEEE, March, 1983.
[25] U.S. Reddy. Transforming Logic Programs into
Functional Programs. In Proc . 1984 Int'!. Symp. on Logic
Programming. February, 1984. (to appear).
[26] J. Tanaka . Optimized concurrent execution of an
applicative language. PhD thesis, University of Utah,
Department of Computer Science, December, 1983.
[27] DA Turner. A new implementation technique for
applicative languages. Software - Practice and Experien~
9:31-49, 1979.
[28] I. Watson, J. Gurd. A practical data flow computer.
IEEE Computer 15(2):51-57, February, 1982.
