The Gradient Model Load Balancing Method by Lin, Frank C. H. & Keller, Robert M.
Claremont Colleges
Scholarship @ Claremont
All HMC Faculty Publications and Research HMC Faculty Scholarship
1-1-1987
The Gradient Model Load Balancing Method
Frank C. H. Lin
Robert M. Keller
Harvey Mudd College
This Article is brought to you for free and open access by the HMC Faculty Scholarship at Scholarship @ Claremont. It has been accepted for inclusion
in All HMC Faculty Publications and Research by an authorized administrator of Scholarship @ Claremont. For more information, please contact
scholarship@cuc.claremont.edu.
Recommended Citation
Lin, Frank C.H., and Robert M. Keller. "The Gradient Model Load Balancing Method." IEEE Transactions on Software Engineering
13.1 ( January 1987): 32-38. DOI: 10.1109/TSE.1987.232563
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-13, NO. 1, JANUARY 1987
The Gradient Model Load Balancing Method
FRANK C. H. LIN AND ROBERT M. KELLER, MEMBER, IEEE
Abstract-A dynamic load balancing method is proposed for a class
of large-diameter multiprocessor systems. The method is based on the
"gradient model," which entails transferring backlogged tasks to
nearby idle processors according to a pressure gradient indirectly es-
tablished by requests from idle processors. The algorithm is fully dis-
tributed and asynchronous. Global balance is achieved by successive
refinements of many localized balances. The gradient model is formu-
lated so as to be independent of system topology.
Index Terms-Applicative systems, computer architecture, data flow,
distributed systems, load balancing, multiprocessor systems, reduction
architecture.
I. INTRODUCTION
OAD balancing enables a multiprocessor system to
distribute tasks effectively to various processing nodes
such that the aggregate throughput of the system is max-
imized. Throughput is normally measured by the system
response time, but it is difficult to use this a posteriori
measure to improve performance dynamically. Hence,
many load balancing studies rely on a secondary measure,
that of processor utilization, to govern load-balancing.
The intuitive idea is that if processor utilization can be
increased without undue overhead, then response time will
be improved.
Conceptually, a load balancing algorithm implements a
mapping function from tasks to processors. A static map-
ping may be exercised during program compilation or
loading [6], [1], [5], 117], [7], [18]. Dynamic balancing
[15], [14], [17], on the other hand, deals with decisions
relating to the mapping of tasks during the computation
itself. The effectiveness of a static balancing method
hinges on the accuracy of the assignment function whereas
the effectiveness of a dynamic balancing depends on the
efficiency of the task migration techniques.
This study focuses on dynamic load balancing issues of
loosely coupled, large-scale applicative systemns [8], [1],
[10], [11], [20]. A program is started by generating a task
at one processor. For a parallel application, this task will
"spawn" additional tasks, which in turn continue spawn-
ing to build up a work backlog, requiring further dispersal
of the work load.
Section II mentions related load balancing strategies.
The gradient model for load migration and balancing is
Manuscript received January 31, 1986; revised June 16, 1986.
F. C. H. Lin is with ESL Inc., 495 Java Drive, Sunnyvale, CA 94088.
R. M. Keller was with the Department of Computer Science, University
of Utah, Salt Lake City, UT 84112. He is now with Quintus Computer
Systems, Inc., Mountain View, CA 94041.
IEEE Log Number 8611359.
described in Section III. Variations of the gradient model
in heterogeneous systems are also presented. An imple-
mentation of the proposed model on an applicative mul-
tiprocessor system is described in Section IV, followed
by the presentation of simulation results.
II. RELATED RESEARCH
Kratzer [14] suggested a swap-bid protocol for distrib-
uted load balancing. When a processor receives a "status
update message," it finds the best possible task move-
ment to/from another node or a swap of tasks with respect
to a performance estimation heuristic.
Load balancing in the Purdue Engineering Computer
Network [7] is implemented by a deterministic balancing
policy. A load average, which represents the degree of
idleness, is maintained in each network machine's kernel.
Before a command is processed, the load averages from
every network machine are obtained and the one with the
minimum load average is chosen.
Ni [15] proposed a load balancing method for a small
scale point-to-point multiprocessor system. An idle pro-
cessor sends a request message to its neighbors. The
neighbors respond with a busy or not-busy status indica-
tion. The idle processor then selects a target neighbor and
sends it a draft message. The target processor may either
respond with a new task or respond with a too-late mes-
sage if the new task has been drafted by another proces-
sor. Tasks can be migrated at most one hop away from
the originating host. A modified version of the draft pro-
tocol was recently published [16].
Stankovic [19] suggested using an expert system for
heuristic scheduling algorithms. This is a broad and am-
bitious approach and no details of such an expert system
have been described up to now.
Turning to applicative systems, the data flow machine
proposed by [4] used a round-robin centralized scheduler
to arbitrate operation packets to processing units. The
AMPS project [8] employed a tree structure which recur-
sively balanced tasks onto the processor tree. Load bal-
ancing was handled by the nonleaf processors, with tasks
shifted from one subtree to another in order to reduce load
differential between adjacent subtrees.
Gostelow [6] proposed a token-ring network where each
node had four processing elements and one shared local
memory. New tasks were mapped onto processors by a
system-wide hash function. Several hash functions were
studied and it was concluded that system performance in-
creased if program locality could be enforced by the hash
function.
0098-5589;87/0100-0032$01.00 © 1987 IEEE
32
Authorized licensed use limited to: to the Claremont Colleges!. Downloaded on November 14, 2008 at 16:04 from IEEE Xplore.  Restrictions apply.
LIN AND KELLER: GRADIENT MODEL LOAD BALANCING METHOD
The ZAPP system [1] attempted to match the creation
of new tasks to available processors while executing a di-
vide and conquer algorithm. This too used the idea of
stealing a task from a neighbor for load balancing.
III. THE GRADIENT MODEL
The gradient model is a localized load balancing method
where every processor interacts only with its immediate
neighbors. A global balancing is achieved by propagation
and successive refinement of local load information. An
idle or underutilized processor initiates the load balancing
activities by demanding more work load. The demand is
indirectly relayed through the system in a manner to be
described. A demand is fulfilled by the arrival of a task
or tasks from other heavily loaded processors.
A. Gradient Surface
The gradient model employs a two-tiered load balanc-
ing algorithm. The first step is to let each individual pro-
cessor determine its own loading condition. The time-
varying load state of a processor may be light, moderate,
or heavy. Colloquially, if a processor is light, it wishes
to have more load given to it. If it is heavy, it wishes to
get rid of some of its present load. If neither of these con-
ditions holds, then it is moderate.
Definition: The distance d between two processors i
and j, of a multiprocessor network is the length of the
shortest path between i andj. The diameter of a multipro-
cessor network N is the maximum distance between any
two nodes of N, i.e.,
diameter (N) = max {di,; for all i, j in N}
Definition: The gate of a processor i is a binary function
gi. A gate is open if the node is lightly loaded. Otherwise
it is closed. In the gradient model, gi is defined as:
gj := 0 if gate i is open
gi := Wmax if gate i is closed
where wmax = diameter (N) + 1.
Intuitively, a processor welcomes the influx of new
tasks by opening its gate. The second step of the gradient
model load balancing method is to establish a system-wide
gradient surface to facilitate task migrations. The gradient
surface is represented by the aggregate value of all prox-
imities.
Definition: The proximity of a processor i, wi, is the
minimum distance between the processor and a lightly
loaded node in the system. If there is no light node in a
system, wi is defined as Wmax. This means,
wi= min {di,k over k where gk = 0}
if there exists a k such that gk = 0
or
Wi=: Wmax if for all k, gk = Wmax.
The proximity of a light node is zero. The proximity of
2~~~~~~~~~~P
1 03oE 1 E3 2 E
2I~~~~~~~~~~~~~~~~~~
P- 2 3lP
Fig. 1. Proximity distribution and gradient surface.
its immediate neighbors is one, indicating that these nodes
are one hop away from a light processor. The proximity
of the neighbor's neighbor is two, etc. A system without
any light node can be considered as a system with a light
node at a distance beyond the diameter of the network.
Definition: The gradient surface (GS) of a network is
the collection of proximities of all processors. GS = [wl,
W2, W3, * , Wn]j
Given a neighboring relation and a gate distribution,
the gradient surface of a system is stable and determinate.
As an example, Fig. 1 depicts a system with a 4 x 4
rectangular configuration. Assume nodes 5 and 13 are
lightly loaded; the proximity function of each processor
is shown in italic. These values comprise a gradient sur-
face.
B. Gradient Surface Approximation
The gradient surface has multiple attributes. First, it is
a network-wide indication of all under-utilized proces-
sors. Second, it carries an implicit request for work load.
Third, it serves as a minimum distance routing pointer for
directing unprocessed tasks. However, formulation of the
gradient surface requires the knowledge of all proximi-
ties. Accurate calculation of a proximity requires gate val-
ues of all processors, which are not readily available in a
distributed environment. In this section, we suggest a dis-
tributed measurement, termed propagated pressure, to
approximate the proximity function.
Definition: The propagated pressure of a processor pi
is defined by the following equation.
pi = mm {gi, 1 + min {pj overj, where di,j = 1}}
A lightly loaded processor has propagated pressure of
zero. Propagated pressure of a moderate or heavy proces-
sor is computed by adding one to the minimum propa-
gated pressure of neighbors. Since gi < = Wmax, Wmax is
also the upper bound of propagated pressures.
Definition: The pressure surface (PS) of a network is
the collection of propagated pressures of all processors.
PS = [P1 P2, P3, ,Pn]-
Definition: A pressure surface is apparently stable if
the last value of each propagated pressure is equal to its
newly computed value.
33
Authorized licensed use limited to: to the Claremont Colleges!. Downloaded on November 14, 2008 at 16:04 from IEEE Xplore.  Restrictions apply.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-13, NO. 1, JANUARY 1987
Theorem: When an apparently stable pressure surface
is reached, p = wi.
Proof: If there is no light node in the system, by def-
inition,
=max for all i
p = min {Wma 1 + min {Wmax, Wmax,
- mn {Wmx, 1 + Wx}
- Wmax
- Wi.
If there-exists a light node k, in the system, by definition,
wk = 0. If node i is distance n away from node k, w1 =
n.
case i: n = 0, i.e., k = i
Pi = mm {0, some positive number}= 0
wi = min {di,i} = 0 because g = 0
Pi = Wi
case ii: Assume that node i is distance n away from k
and pi = w,. We try to prove that given a node j at dis-
tance n + 1, pi = wJ. Without loss of generality, we as-
sume that nodes i and x are immediate neighbors ofj (Fig.
2).
p = min {gj, 1 + min {pi, px}}
=nn Iwm, 1 + minm {w,p}
= 1 +min{wi,pX} sincel w i<=wma
In an apparently stable pressure surface, one of the fol-
lowing three conditions holds:
(1) P I= 1 +
Since pj 1 +p =i1 + wi, px 1 + 1 + wi,
hence Px > Wi
(2) px =pj;
p= 1 + we; hence px > wi
(3)px=p,p- 1;
p= 1 +w - 1; hencepx=w
In any events, px > wi
Therefore,pj = I + min {wi,px} = I + w wjQ.E.D.
The gradient model uses the calculated propagated
pressure to approximate the proximity function. Exces-
sive tasks from heavily loaded nodes is routed to the
neighbor of the least propagated pressure. There is no ul-
timate destination assigned to a task when it is moving in
the system. The proximate destination of a task is de-
signed such that a localized balancing is easily achieved.
Ultimate balancing of the system is accomplished through
multiple overlapped local balancing.
Fig. 2. Neighboring relations among k, i, j, and x.
Such a load migration procedure continues until one of
the following conditions is satisfied: 1) the task arrives at
the light node, or 2) some other tasks arrive at the light
node and close the gate. If there exists some other under-
utilized node in the system, a new gradient surface is re-
shaped. The task is then redirected toward the new nearest
light processor.
In mathematics terns, the gradient model load balanc-
ing scheme is a form of relaxation. The single hop mi-
gration of tasks is a successive approximation method to-
ward a global balancing of a system.
C Saturation
After all processors become busy, there is no need to
further balance the load because processors have been
fully utilized. Any load balancing activity during this pe-
riod can only increase system overhead and reduce
throughput.
Definition: A system is saturated if none of the pro-
cessors are lightly loaded. In other words, a system is
saturated if all proximities are equal to Wmax.
The use of a ceiling value wmax is shown to prevent
propagated pressure from becoming unbounded unneces-
sarily in a fully loaded system. Another use of the Wmax is
to detect saturation and reduce futile task movements. The
saturation state persists until some processor becomes
lighdy loaded again. The third usage of Wmax is in the area
of fault isolations. When a processor fails, its neighbors
can mandatorily set the pressure of the failed processor to
be Wmax which stops the moving of new tasks toward the
faulty node [131.-
D. Algorithm
Based on the gradient model, a distributed load balanc-
ing algorithm for each node of a system can be devised.
LOOP
Processor i determines its internal loading state pi
CASE state OF
light:
Setpi = 0.
Ignore the pressure information from
neighbors.
moderate:
pi = 1 + min {pj} over all neighborj.
IfP > wmax then pi Wmax
(*saturation*)
heavy:
pi - 1 + min {p } over all neighbor j.
IfPi> Wmax
then pi- Wmax (*saturation*)
else if min {pi} < pi,
34
Authorized licensed use limited to: to the Claremont Colleges!. Downloaded on November 14, 2008 at 16:04 from IEEE Xplore.  Restrictions apply.
LIN AND KELLER: GRADIENT MODEL LOAD BALANCING METHOD
then transfer one task to node j,
where pj is the minimum.
END CASE
Broadcast pi to all neighbors, if pi has changed
from last update.
END LOOP.
The execution of this algorithm is fully asynchronous
and distributed. All processors independently update their
pressure by using the most recent information from the
neighbors. It should be noted that a light node sets its own
pressure to 0 and immediately delivers this information to
the neighbors, since 0 is the minimum proximity. This
allows the lightly loaded processor to trigger the load mi-
gration as soon as possible.
The algorithm is a localized load balancing measure.
The dynamics of local balancing represent a step-wise re-
finement toward a global load balancing. At any instant
there are some subregions of a system adjusting their re-
gional gradient distributions and load balancing their work
load.
E. Heterogeneous Systems
A heterogeneous multicomputer system may be com-
posed of different processor types, varying processing
power, or many kinds of communication links. The gra-
dient model can be enhanced to accommodate this class
of heterogeneous systems.
1) Processor Types: When a multicomputer system has
more than one type of processors, some tasks may have
to be evaluated by certain kind of processors. A group
addressing scheme is used to facilitate this type of appli-
cation.
A task with a group address as the destination is treated
as a regular task. When a task is absorbed by a processor,
the node verifies the group address with the destination
identification. The task is reinjected into the system if the
address mismatches. This try and reinject method is costly
if the size of a group is relatively small. The approach
becomes more attractive as the size of the group grows.
One example of using this technique is in a system with
interleaving floating point processors. A compute-bound
task may designate the group address for floating point
processors as the destination. Since the floating point pro-
cessors are interleaved with regular processors, the over-
head of absorbtion and reinjection could be minimal.
Another approach is to formulate multiple gradient sur-
faces, one surface for each type of processors. Tasks of
different types are balanced by different gradient surfaces.
In general, multiple load managers within a processor are
needed.
2) Processing Power: The first step of the load bal-
ancing algorithm is for the processor to assess its own
loading condition to be either light, moderate, or heavy.
The load state is devised to be a pure internal measure of
a processor. As a result, the load computation methods of
different nodes need not be identical. A system composed
of processors with different processing power may use dif-
ferent criteria for determining a processor state. For ex-
ample, a node with twice the processing power may re-
quire' twice as many tasks to become heavily loaded.
Once a node determines its load state, the gradient
model makes no distinction among different processors.
The decoupling of internal load measurement from net-
work-wide balancing is an essential feature of the gradient
model.
3) Communication Link: The proximity function,
which computes the minimum "length" between two
processors, implicitly assumes identical cost for each
communication link. In a system with heterogeneous
communication capabilities, the proximity function and
the pressure approximation is better served with a cost
function.
Pi = min {gi, min {ci,j + pj overj, where dij = 1}}
i.e.,
pi = min {gi, c1j + pj over], where dij = 1}}
where cij is the communication cost between node i and
j. Compared to earlier pressure definition, it is obvious
that the homogeneous communication is a special case
where ci, j = 1 for all i and j.
IV. AN IMPLEMENTATION
The gradient model load balancing scheme has been
implemented in the simulator for Rediflow, a loosely cou-
pled applicative system proposed by researchers at the
University of Utah [10]-[121. Each processor is closely
paired with a memory, and a network of packet switches
is used to communicate between these pairs. The combi-
nation of a processor-memory pair and a packet switch
for information transfer is called an Xputer.
A. The Rediflow Simulator
The Rediflow simulator is based on a graph-reduction
model of computation and driven by programs written in
an applicative language, Function Equation Language
(FEL) [9], so named because its expressions are literally
equations describing functions and objects.
The simulator permits the specification of various pa-
rameters, including the number of Xputers, the amount of
memory, the configuration of the Xputers, the switch ca-
pacities, the communication bandwidth, and others. The
loading status of an Xputer is computed as a function of
the backlog of tasks and the amount of memory in use.
The internal load measurement is equated to
number-of-tasks + memprs/(l - memory-in-use)
where memprs may be specified as a simulator parameter.
In simulations reported here, memprs is set to 0.01.
In the simulator, an Xputer is light if its internal load
measurement falls below a settable low threshold, and
heavy if it rises beyond a settable high threshold. The set-
ting of low and high threshold used here is 2 and 3, re-
spectively.
The default size of an APPLY packet is 20 bytes. This
35
Authorized licensed use limited to: to the Claremont Colleges!. Downloaded on November 14, 2008 at 16:04 from IEEE Xplore.  Restrictions apply.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-13, NO. 1, JANUARY 1987
is sufficient to carry the characterizing information of
code-block pointer, argument location, and result loca-
tion. (Longer arguments and results are handled as de-
scriptors to structures.) A DATA packet has 10 bytes and
a pressure update packet has only 1 byte. The function
code of an APPLY packet may be disseminated to all pro-
cessors before the execution starts. Otherwise, the code
may be transferred along with the APPLY packet and
cached in the destination node. Communication delays
between two switches is another adjustable parameter.
Initially, the communication channel speed was set to 10
Mbits per second for each channel.
B. Simulation Results
The performance of the Rediflow architecture is eval-
uated using an introspective model. The speedups of a
multiprocessor system are measured against a single pro-
cessor with the same technological assumptions, architec-
ture, and evaluation model.
The simulation is event-driven. Messages sent between
switches are serviced in time-stamp order. Since we are
simulating mostly determinate programs with an invari-
able number of noncommunication operations and each of
a similar duration, so speedup can be computed on the
fly, as
speedup = total time of operations / simulation_time.
which is equivalent to the reciprocal of the parallel exe-
cution time divided by the sequential execution time.
As an example, we use a highly parallel test program
which is a purified divide and conquer algorithm, sum-
ming the leaves of a binary tree with nodes numbered 1-
1024. With the syntax of the functional language FEL,
the program DC1024 is shown as follows.
result DC[1,1024]
DC[m,n] =
result if m > =n
then m
else DC[m,med] +DC[med+ 1,n]
med = (m+n) div 2
}
}
The program DC 1024 is run on the Rediflow simulator
with an increasing number of Xputers. Given a fixed num-
ber of Xputers, DC1024 is exercised on several configu-
rations. Different topologies used in the simulation are de-
picted in Fig. 3. The speedup of the simulation versus the
size and topology of the system is shown in Fig. 4.
It is no surprise that wrapped topology performs better
than the nonwrapped configuration, since the average dis-
tance between any two Xputers in the former is only about
half that in the latter. Both the task packets and pressure
updates benefit from the shorter communication distance.
The hypercube configuration appears to be the most ef-
ficient topology of these alternatives. This is to be ex-
GRID
WRAP
CUBE
HYPERCUBE
Fig. 3. Topologies used in simulation.
40.00
03cn 35.00
25.00
20.00
15.00
10.00
5.00
0
+ grld
A wrap
x cube
o hypercubs
10 20 30 40 50 60 70
Number of Processors
Fig. 4. Speedup of DC1024.
pected, since for a given number of nodes, this configu-
ration has the smallest diameter. The smaller the diameter,
the faster that saturation can be detected. However, the
simple wraparound grid configuration performs reason-
ably well. This is encouraging, because the grid is a log-
ical choice for wafer scale VLSI implementations.
1) Idealized Balancing: To assess the effectiveness of
a load balancing scheme, one needs to identify an ideal
balancing method and compare it with the given method.
A shared-memory model with a centralized scheduling
queue would seem to provide the ultimate in load balanc-
ing. It must be assumed that communication cost is neg-
ligible. An underutilized processor requests further work
load from this central facility. For comparison to the load-
balancing rule under various configurations, the idealized
case is shown in Fig. 4 as the upper dashed curve.
2) Trajectory: Periodic snapshots of the Xputer utili-
zations shown in Fig. 5 are compiled from DC1024 run-
ning with a 4 x 4 wrapped grid configuration. The time
interval between samples is 5000 simulation time units.
The processor and switch utilizations are shown as inter-
val averages. The switch utilization curve also depicts the
breakdown of switch traffic between user data packets and
system pressure updating packets. It shows that the up-
36
Authorized licensed use limited to: to the Claremont Colleges!. Downloaded on November 14, 2008 at 16:04 from IEEE Xplore.  Restrictions apply.
LIN AND KELLER: GRADIENT MODEL LOAD BALANCING METHOD
c
a
4J
N
4
14.00
@ 13.00
Ca
en 1
12.001
11.00!
10.00
9. 00
8.001
Fig. 5. Trajectory of processor and switch utilizations.
100
soo
o' 90
4J
lN 8
.4 8
4)
= 70
0 5 10 15 20 25
Comm. Bandwidth per Channel (MHz)
Fig. 6. Utilization of different communication bandwidths.
dating packets use only a fraction of communication
bandwidth in this example.
Processor utilization rises fairly rapidly during early
stages of the simulation cycle. It indicates the effective-
ness of spreading work load over idle Xputers. The slope
of the rising curve reflects the eagerness of load spread-
ing. This rate is controllable by adjusting the high and low
load thresholds. Note that the processor utilization dips
when the simulation time is around 70 000 time unit. The
reason is that the system is engaged in a garbage collec-
tion operation where memory utilization changes signifi-
cantly.
3) Communication Speed: The results from the above
simulations show that the switch utilization is well within
our technology assumption, which assumes each com-
munication channel has a 10 Mbit per second data transfer
rate. Since communication overhead is a critical gauge in
any multiprocessing system, this section examines the ef-
fect of communication bandwidth by changing the speed
settings in the Rediflow simulator. The test program is
still DC1024 with wrapped 4 x 4 grid configuration. Fig.
6 shows the impact of communication bandwidth on pro-
cessor and switch utilizations. The speedup of the system
is depicted in Fig. 7.
Slow communication channels, e.g., 1 MHz, signifi-
cantly impede the system throughput, since the processors
are only utilized half of the time. The speedup increases
as the communication bandwidth improves. However,
7.00
0 5 10 15 g0 25
Comm. Bandwidth per Channel (MHz)
Fig. 7. Speedups of different communication bandwidths.
when the channel capacity exceeds the 10 MHz rate, the
system starts to approach the throughput upper bound.
This simulation shows that a communication channel
speed of 10 MHz seems adequate for this combination of
processors, switches, and task granularities.
Similar, although less extensive, simulation studies
have been conducted on more practical examples, such as
matrix multiplication using quad-trees, histogramming for
image processing, logic programming, and n-queens
search, etc. The DC1024 example used here seems to be
representative, in terms of the amount of usable concur-
rency present, and the resulting system performance.
V. SUMMARY
The load balancing problem is crucial in multiprocessor
systems having large numbers of processors and which
spawn many concurrent tasks. Any balancing scheme re-
quiring a centralized action seems impractical when the
system scales up. Applications with spontaneous task
generation also make it difficult to prenegotiate a balanced
distribution.
In this study, a distributed load balancing scheme,
called the gradient model, is devised. The model is based
on a demand-driven principle which requires the under-
utilized processors to dynamically initiate load balancing
requests. A system-wide gradient surface is formed as a
result of these requests. Overloaded processors respond
to requests by migrating unevaluated tasks down the gra-
dient surface toward under-utilized processors.
A global balance state is achieved computationally by
successive approximation of many localized balances. The
concept of saturation is introduced to discourage futile
load migration when the system is fully utilized.
The Rediflow simulator, which simulates a proposed
applicative system, incorporates the gradient model load
balancing mechanism. Various architectural tradeoffs have
been studied with the simulation. Simulation studies sug-
gest that the gradient model performs satisfactorily under
reasonable technological assumptions.
REFERENCES
[1] F. W. Burton and M. R. Sleep, "Executing functional programs on
a virtual tree of processors," in Proc. 1981 Conf. Functional Pro-
37
.
Authorized licensed use limited to: to the Claremont Colleges!. Downloaded on November 14, 2008 at 16:04 from IEEE Xplore.  Restrictions apply.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-13, NO. 1, JANUARY 1987
gramming Languages and Computer Architecture, Oct. 1981, pp.
187-194.
[2] Y. Chow and W. H. Kohler, "Models for dynamic load balancing in
a heterogeneous multiple processor system," IEEE Trans. Comput.,
vol. C-28, pp. 354-361, May 1979.
[3] A. L. Davis and R. M. Keller, "Dataflow program graphs," Com-
puter, vol. 15, no. 2, pp. 26-41, Feb. 1982.
[4] J. B. Dennis and D. P. Misunas, "A preliminary architecture for a
basic data-flow processor," in Proc. 2nd Annu. Symp. Comput. Ar-
chitecture, IEEE, 1974.
[5] K. Efe, "Heuristic models of task assignment scheduling in distrib-
uted systems," Computer, vol. 15, no. 6, pp. 50-56, June 1982.
[6] K. P. Gostelow and R. E. Thomas, "Performance of a simulated da-
taflow computer," IEEE Trans. Comput., vol. C-29, pp. 905-919,
Oct. 1980.
[7] K. Hwang et al., "A UNIX-based local computer network with load
balancing," Computer, vol. 15, pp. 55-64, Apr. 1982.
[8] R. M. Keller, G. Lindstrom, and S. Patil, "A loosely-coupled appli-
cative multi-prQcessing system," in Proc. AFIPS, June 1979, pp. 613-
622.
[9] R. M. Keller, "Function-equation language programmer's guide,"
Dep. Comput. Sci., Univ. Utah, AMPS Tech. Memo. 7, Apr. 1982.
[10] R. M. Keller, F. C. H. Lin, and J. Tanaka, "Rediflow multiprocess-
ing," in Proc. CompCon '84, IEEE, Feb. 1984, pp. 410-417.
[11] R. M. Keller and F. C. H. Lin, "Simulated performance of a reduc-
tion-based multiprocessor," Computer, vol. 17, no. 7, pp. 70-82,
July 1984.
[12] R. M. Keller, "Rediflow architecture prospectus," UUCS-85-105,
Dep. Comput. Sci., Univ. Utah, Tech. Rep. UUCS-85-105, Aug.
1985.
[13] F. C. H. Lin and R. M. Keller, "Distributed recovery in applicative
systems," in Proc. Int. Conf. Parallel Processing, Aug. 1986.
[14] A. Kratzer and D. Hammerstrom, "A study of load levelling," in
Proc. CompCon, IEEE, Fall 1980, pp. 647-654.
[15] L. M. Ni, "A distributed load balancing algorithm for point-to-point
local computer networks," in Proc. CompCon, IEEE, Fall 1982, pp.
116-123.
[16] L. M. Ni, C. W. Xu, and T. B. Gendreau, "Drafting algorithm-A
dynamic process migration protocol for distributed systems," in Proc.
5th Int. Conf. Distributed Comput. Syst., IEEE, Denver, CO, May
1985, pp. 539-546.
[17] D. D. Sharp and P. L. Crews, "Work distribution in a bus-structured
fully distributed processing system," in Proc. CompCon, IEEE, Sept.
1983, pp. 42-49.
[18] J. A. Stankovic and I. S. Sidhu, "An adaptive bidding algorithm for
processes, clusters and distributed groups," in Proc. 4th Int. Conf.
Distributed Comput. Syst., IEEE, San Francisco, CA, May 1984, pp.
49-59.
[19] J. A. Stankovic, "A perspective on distributed computer systems,"
IEEE Trans. Comput., vol. C-33, pp. 1102-1115, Dec. 1984.
[20] S. R. Vegdahl, "A survey of proposed architectures for the execution
of functional languages," IEEE Trans. Comput., vol. C-33, pp. 1050-
1071, Dec. 1984.
m S _Frank C. H. Lin received the B.S.E.E. degree
from the National Taiwan University in 1973, the
l M.S.E.E. degree from Utah State University, Lo-
gan, in 1978, and the Ph.D. degree in computer
science from the University of Utah, Salt Lake
City, in 1985.
He is a Program Manager with ESL Inc., a
subsidiary of TRW, in California. He was with
CalComp Electronics Inc. from 1975 to 1976.
From 1978 to 1986, he was with Sperry Corpo-
ration in Salt Lake City where he last served as
manager of the Research and Advanced Technology Group. His current
research interests are parallel architectures, heterogeneous systems, data
flow machines, and fault-tolerant computing.
Robert M. Keller (S'64-M'66) received the B.S.
and M.S.E.E. degrees from Washington Univer-
sity, St. Louis, MO, and the Ph.D. degree from
the University of California, Berkeley, all in elec-
trical engineering and computer science.
From 1970 to 1976 he was an Assistant Pro-
fessor of Electrical Engineering at Princeton Uni-
versity. From 1976 to 1986 he was Associate Pro-
fessor, then Professor, of Computer Science at the
; University of Utah, Salt Lake City. He has held
visiting appointments at Stanford University and
Lawrence Livermore National Laboratories. He is currently Director of
Research at Quintus Computer Systems, Inc. in Mountain View, CA. His
research contributions are in the area of theory of concurrent processing,
parallel program verification, parallel computer architecture, and imple-
mentation of functional languages. His current research interests deal with
numerous topics relating to logic programming and multiprocessing.
38
Authorized licensed use limited to: to the Claremont Colleges!. Downloaded on November 14, 2008 at 16:04 from IEEE Xplore.  Restrictions apply.
