Method of up-front load balancing for local memory parallel processors by Baffes, Paul Thomas
United States Patent [191 
Baffes 
! 
! 
: 
[11] Patent Number: 
REDUCE THE NUMBER OF 
PROCESS SETS UNTIL ,--2o 
CESSING UNITS 
EQUALS NUMBER OF PRO- 
4,920,487 
[45] Date of Patent: Apr. 24, 1990 
[54] METHOD OF UP-FRONT LOAD 
BALANCING FOR LOCAL MEMORY 
PARALLEL PROCESSORS 
[75] Inventor: Paul T. Baffes, Houston, Tex. 
[73] Assignee: The United States of America as 
represented by the Administrator of 
the National Aeronautics and Space 
Administration, Washington, D.C. 
[21] Appl. No.: 283,106 
[22] Filed: Dec. 12, 1988 
[51] Int. ( 3 . 5  ......................... G06F 9/46, G06F 15/16 
[52] U.S. c1. ................................. 364/300; 364/228.3; 
364/231.9; 364/280; 364/281 
[58] Field of Search ... 364/200 MS File, 900 MS File, 
364/300 
[561 References Cited 
U.S. PATENT DOCUMENTS 
4,400,768 8/1983 Tomlinson .......................... 364/200 
4,410,944 10/1983 Kronies ............................. 364/200 
4,468,736 8/1984 DeSantis et al. .................... 364/200 
4,491,932 1/1985 Ruhman et al. .................... 364/900 
4,495,570 1/1985 Kitajima et al. .................... 364/200 
4,590,555 5/1985 Bourrez ............................. 364/200 
4,633,387 12/1985 Hartung et al. ..................... 364/200 
OTHER PUBLICATIONS 
“Design of a Neural Network Simulator on a Tran- 
sputer Array”, by Gary McIntire, James Villarreal, 
Paul Baffes, & Monica Rua presented at Space Opera- 
tions-Automation and Robotics Workship 87, NASA/- 
Johnson Space Center, Houston, TX, 8/5-7/87. 
“Performance Tradeoffs in Static and Dynamic Load 
Balancing Strategies,” by M. Ashraf Igbal, Joel G. Saltz 
and Shahid H. Bokhari, Institute for Computer Applica- 
tions in Science and Engineering, NASA Langley Re- 
search Center, Hampton, Va. 23665, Mar. 1986. 
Primary Examiner-Raulfe B. Zache 
Attorney, Agent, or Firm-Hardie R. Barr; John R. 
Manning; Edward K. Fein 
1571 ABSTRACT 
In a parallel processing computer system with multiple 
processing units and shared memory, a method is dis- 
closed for uniformly balancing the aggregate computa- 
tional load in, and utilizing a minimal memory by, a 
network having identical computations to be executed 
at each connection therein. Read-only and read-write 
memory are subdivided into a plurality of partitions, 
and the computational load is subdivided into a plurality 
of process sets, which function like artificial processing 
units. Said plurality of process sets is iteratively merged 
and reduced to the number of processing units without 
exceeding the balance load. Merger is based upon the 
value of a partition threshold, which is a measure of the 
memory utilization. The turnaround time and memory 
savings of the instant method are functions of the num- 
ber of processing units available and the number of 
partitions into which memory is subdivided. 
8 Claims, 2 Drawing Sheets 
55 
1 
i 
RESET LIST OF PROCESS 50 
https://ntrs.nasa.gov/search.jsp?R=19910005456 2020-03-24T06:42:38+00:00Z
U.S.. Patent 
i 
Apr. 24,1990 
\o 
z 
I 
Sheet 1 of 2 4920,487 
h 
U.S. Patent Apr. 24,1990 Sheet 2 of 2 
27- 
ORDER PROCESS SETS 
BY LOAD ' 
INCREMENT LOAD OF 
PROCESS SE7;. DELETE - 
MERGED SET(S1 FROM 
LIST OF PROCESS SnS 
PARTITION THRESHOLD 
4,920,487 
FIG. 3 
4 TO SJEP 60)  PROCESS SETS <=NUMBER OF 
RAISE PARTITION 
THRESHOLD BY ON€ 
OR MORE 
CALCULATE MERGE 
25 - a-POSSIBILITIES t 
IYES YES 
REORDER LIST OF 
PROCESS SETS 
4.920,487 
a 1 
METHOD OF UP-FRONT LOAD BALANCING FOR 
LOCAL MEMORY PARALLEL PROCESSORS 
ORIGIN OF INVENTION 
The invention described herein was made by an em- 
ployee of the United States Government and may be 
manufactured and used by or for the Government of the 
United States of America for governmental purposes 
without payment of any royalties thereon or therefor. 
FIELD OF THE INVENTION 
This invention relates to a method for distributing 
tasks among a plurality of processors, and more particu- 
larly relates to a method of uniformly distributing tasks 
among parallel processors whereby minimal memory is 
utilized by each processor. 
BACKGROUND OF INVENTION 
L 
prerequisite to achieving the optimal balancing of tasks 
because the distribution of parallel tasks among a plural- 
ity of processors, like the traveling salesman and graph- 
partitioning problems, has been shown to be a member 
5 of the class of nondeterministic polynomial-time com- 
plete (NP-complete) problems. It is known to those 
skilled in the art that such NP-complete problems are 
intractable and defy analytical solution, as discussed by 
0.1 El-Dessouki and W.H. Huen in the IEEE Trans., 
10 vol. C-29, no. 9, September 1980. pp. 818-825, in their 
article entitled “Distributed Enumeration on Between 
Computers.” 
In a parallel multiprocessor environment, the objec- 
tive of load balancing is to distribute computational 
l5 loads among these processors whereby each processor 
executes equivalent loads. Indeed, the more uniformly 
tasks are distributed among the processors, the more 
effectively the multicomputer system is executed be- 
cause the processors are more likely to be actively per- 
It is well known that most computer systems in cur- 2o forming computational tasks. This-balancing is ge&- 
rent use consist Of a single processor with concomitant ally performed either statically or dynamically. 
memory and peripheral devices. Recently, however, Static load balancing is conventionally used when the 
multicomputer environments, consisting of the inter- parallel computational components of a process can be 
Connection Of mUltipk prOCeSSOrS, have become avail- completely ascertained prior to their execution. DY- 
able. In such environments, the computational tasks or 25 namic load balancing is usually used when the attributes 
loads are accomplished by distributing them across the of the parallel computational components of a process 
vary over time, or when none of these attributes can be available plurality of processors. 
It is further known in the prior art that the preferable ascertained 
multicomputer operating environment is one in which For a multicomputer system with many tasks, the 
parallel processing is performed. Generally, computer 30 
to execution. 
systems -with parallei processors either have shared 
memory or dedicated memory. In shared memory com- 
puter systems, all of the available memory is shared 
among all of the parallel processors. Thus, the available 
memory is not associated with any individual processor 35 
but is a resource associated with the entire computer 
system On the other hand, in a dedicated memory com- 
puter system, the available memory is allocated to each 
individual processor. Each quantum of memory allo- 
cated to a processor is for that processor’s exclusive use. 40 
No sharing between processors occurs. 
Regardless of whether the memory in a parallel com- 
puter system is shared or dedicated for a particular 
process to be accommodated under this environment, 
its panoply of computational tasks must be subdivided 
into a set of parallel components. As is known to those 
skilled in the art, parallel components may be executed 
separately and independently of other parallel compo- 
nents. But as is further known to those skilled in this art, 
the subdivision of a process into parallel components is 
often a difficult task in itself. As an illustration, in U.S. 
Pat. No. 4,468,736, DeSantis, et al., disclose a method 
for decomposing a process into independent, disjoint 
tasks for parallel processing. Once the parallel compo- 
nents of a process have been established, it will become 
clear that they must be distributed among the proces- 
sors of a multicomputer system to effectuate acceptable 
throughout. 
The distribution or “balancing” of a multicomputer’s 
load among its constituent processors may be referred 
to as “load balancing.” Conventional load balancing 
methodologies have sought to allocate the various loads 
assigned to a multicomputer system by exploiting the 
architecture of a particular computer hardware config- 
uration. This machine-dependency arises because the 
optimal distribution of tasks in a multicomputer envi- 
ronment may be achieved only by enumeration of all 
possible task configurations. Such enumeration is 
45 
50 
55 
60 
65 
enumeration method of distributing tasks is clearly im- 
practical and unmanageable. Accordingly, it is well 
known in the prior art that heuristic methods may be 
used to achieve a reasonable, albeit suboptimal, distribu- 
tion of tasks as herein discussed. It is apparent in the 
prior art that to achieve optimal load balancing in a 
parallel processing environment requires a formidable 
expenditure of processing time. It is conventional to 
avoid these rigorous constraints by heuristically ascer- 
taining a suboptimal load balance. Such a heuristic de- 
termination is achieved at a mere fraction of the system 
resource and without the hereinbefore mentioned infor- 
mation about the composition of the process load mix. 
One such heuristic method known in the prior art is 
called “pipelining.” This method is applicable to pro- 
cesses which can be subdivided into parallel processes 
which need minimal amounts of data. When the first 
available processor requests a load, a process and its 
concomitant data is pipelined thereto. As is known to 
those skilled in the art, this method is useful only if the 
computational time is longer than the time expended 
initiating the computation and communicating its re- 
sults. It will become clear that if the contrary occurs, 
the processors tend to remain idle because too much 
time is expended on information flow. 
Another method known in the prior art is called 
“vectorizing.” This method is applicable to independent 
processes for which identical computations are per- 
formed. Multiple identical computations are performed 
on large arrays during each iteration, and each such 
iteration is uniformly distributed among the available 
processors. 
Several methods and systems have been developed to 
improve the load balancing art. For example, Hartung, 
et al., in U.S. Pat. No. 4,633,387 teach a method of 
dynamic load balancing whereby work queues in a 
shared memory environment are examined to ascertain 
whether work-requesting thresholds have been met. 
3 
4,920,487 
4 
Similarly, Ruhman, et al., in U.S. Pat. No. 4,491,932 loads among parallel processors. For instance, the paper 
disclose a method to partition shared memory for dis- “Performance Tradeoffs in Static and Dynamic Load 
tributing the loads of disjoint processes into a reconfigu- BaIancing Strategies” by Ashraf Iqbal, Joel H SaItz and 
rable array. In U.S. Pat. No. 4,495,570, Kitajima, et al. Shahid H. Bokhari, under NASA contracts NAS1- 
discloses a method for dynamically distributing the 5 17070 and NAS1-18107, describes the limitations of 
loads in a dedicated memory parallel processing envi- various static and dynamic load balancing methods. 
ronment whereby a processing request allocator exe- None of the methods referenced therein, however, has 
cutes service requests based upon process waiting and sought to accomplish such distribution concomitant 
delay times. with the utilization of minimal memory. 
SUMMARY OF INVENTION handle an arbitrary set of tasks. That is, no a priori 
information about the number or size of the tasks is The present invention provides a method to uni- 
known. However, in applications where task informa- formly distribute the computational load of an artificial 
tion is known a priori special methodologies incorporat- neural network, and the like, among the processing 
ing the task sizes and respective interdependencies 15 units of a multicomputer system while utilizing minimal 
therein may be developed. An example of such an appli- memory. 
cation might be a mail carrier who is assigned a maxi- The present invention subdivides the memory of a 
mum amount of letters and packages to deliver in a multicomputer system into a plurality of partitions, with 
predefined geographical area. A similar example might each partition containing either read-only or read-write 
be the delivery of packages by Federal Express wherein 20 memory. The memory contained in a partition is not 
each truck is allocated a maximum number of “loads” shared with any other partition in the computer system. 
which are delivered to predefined locations. Another During the execution of each identical computation 
application might be the mapping of billions of stars in in the network, in which a finite number of machine 
a galaxy whereby each connection between the stars cycles are executed, a process is performed which oper- 
exhibits an identical operation. 25 ates upon particular partitions of memory. The present 
Another example of a set of tasks whose sizes and invention collects these processes which operate upon 
interdependencies are known is a simulated neural net- the same regions of memory into packets. 
work. Such a network consists of multiple, similar pro- In the preferred embodiment of the present invention, 
cesses, whereby nodes, called neurons, are systemati- this memory-partitioning is represented by a two-di- 
cally interconnected via synapses. The neurons may be 3 0  mensional array or grid with read-only memory posi- 
subdivided into groups which it will be seen execute in tioned along one axis and read-write memory posi- 
parallel. For a typical neural network, consisting of tioned along the other axis. Thus, each of the read-only 
hundreds of neurons and thousands of connections, it and read-write memory is subdivided into partitions, 
has been difficult to effective!y distribute the processing depicted by rectangular regions in the array. Each such 
loads absent using the costly and time-consuming enu- 35 region represents one read-only partition and one read- 
meration method. write partition. 
In such a neural network where the processing at The preferred embodiment initially searches the pro- 
each node is identical, the prior art has been faced with cess required for each connection of the network and 
two problems. The first problem is how to effectively enters the computational load for each such connection 
deal with the large memory requirements of the net- 40 into the appropriate region of the array. After all of the 
work typically represented as arrays. The objective is processes have been entered into regions of the array, 
for the processing units to perform the requisite calcula- the corresponding packets are collected into process 
tions while utilizing minimal memory. The second sets which function like artificial processing units. Thus, 
problem is how to efficiently execute the myriad identi- each processing set is initially allocated to a particular 
cal computations throughout the network. Since each 45 memory partition. 
node performs an action related solely to itself and to its It is a feature of the present invention that these pro- 
interconnecting nodes one solution might be to allocate cess sets are used to achieve a uniform distribution of 
each node to a processor in a multicomputer environ- the load whereby each processing unit receives an 
ment. Each of these processors would execute the com- equally balanced load. This balance load corresponds to 
putations for one node in parallel with the computations 5 0  the quotient of the sum of the aggregate load of the 
executed by the other processors. It is apparent that this network and the number of processing units. 
solution is impractical because multicomputer systems It is an advantage of the present invention that each 
typically do not consist of hundreds of processors. process set is guaranteed to be allocated at most the 
It is well known in the prior art that the typical mul- balance load. The initial number of process sets is usu- 
ticomputer system consists of from four to one hundred 55 ally substantially greater than the number of processing 
processors. Accordingly, to efficiently process a neural units, therefore it is necessary to repeatedly merge pro- 
network requires a method of grouping the myriad cess sets until the number of process sets and the num- 
computations into subsets which can be distributed ber of processing units become equal. 
among the available parallel processors. The paper The purpose of the merging of process sets is not only 
“Design of a Neural Network Simulator on a Tran- 60 to reduce the number of process sets to the number of 
sputer Array” by Gary McIntire, et al., presented at the processing units, but also to combine the process sets 
Space Operations-Automation and Robotics Workshop whereby the regions of memory involved in each 
at NASA / Johnson Space Center on Aug. 5-7, 1987, merger are in close proximity to each other. Since the 
elucidates the nature of the problem and subsetting regions of memory are represented by a NxN array, it is 
strategies. 65 a feature of the present invention that the merger is 
As has been hereinbefore discussed, those skilled in limited by the current partition threshold. 
the prior art are familiar with various static and dy- The partition threshold represents the acceptable 
namic methods which have attempted to distribute proximity between regions in the array, for merger 
Typically, the load balancing methodology used must 10 
4,920,487 
5 6 
purposes. Thus, merging proceeds iteratively by com- sponding one of said identical computations in said 
bining processing sets without exceeding the balance network, generating a first electrical signal functionally 
load and by attempting to keep the partition threshold related to the balance load in said network, generating a 
within a value prescribed for the current iteration. As second plurality of electrical signals functionally estab- 
will become apparent, each iteration starts with a pre- 5 lishing a preselected plurality of partitions of said mem- 
scribed partition threshold, and typically terminates ory, generating in response to said first and second 
with a higher value of the threshold. pluralities of electrical signals, a first sequence of elec- 
This higher value of the partition threshold corre- trical signals functionally dividing said computational 
sponds to a dynamic adjustment of the allowable merge load into a plurality of process sets, generating in re- 
proximity limits. If no merges or an insufficient number 10 sponse to said first electrical signal and said first se- 
of merges occur at the current threshold value, the quence of electrical signals, a second sequence of elec- 
present invention increments the size of the partition trical signals functionally allocating said process sets 
threshold by one and then restarts the iteration. As will among said memory partitions, and generating in re- 
become apparent, each such iteration starts with the sponse to said second sequence of electrical signals, a 
lowest partition threshold of 2, and is repeated with a 15 third sequence of electrical signals functionally merging 
successively higher threshold until the best threshold said process sets until they are equal in number to said 
for the network has been found. plurality of processing units. 
The lowest partition threshold which allocates the These and other objects and features of the present 
balance load among the process sets or, in actuality, the invention will become apparent from the following 
processing units, will correspond to load balancing 20 detailed description, wherein reference is made to the 
requiring minimal memory. This improved distribution figures in the accompanying drawings. 
BRIEF DESCRIPTION OF THE DRAWINGS 
of load or load balancing is performed, in accordance 
with the present invention, once and only once, after 
which the multiple processes throughout the network FIG. 1 is a block diagram illustrating the operation of 
may be independently executed exclusively by the pro- 25 the main steps of the present invention. 
cessor assigned to them. FIG. 2 is a block diagram illustrating the operation of 
It is a feature of the present invention that the desired the steps of the present invention which divide the 
merger of process sets is guaranteed because the balance programs into process sets. 
load is interrelated with the number of processing units. FIG. 3 is a block diagram illustrating the operation of 
Another important feature of the present invention is 30 the steps of the present invention which reduce the 
that the execution time to balance the load of an artific- number of process sets to the number of processing 
ial neural network, and the like, is dependent only on units. 
DETAILED DESCRIPTION the number of processing units contained in the com- puter system and the number of partitions into which 
memory is subdivided. Thus, this execution time is inde- 35 Referring now to FIG. 1, there may be seen the steps 
pendent of the size of the network input. comprising the concept of the present invention. The 
Accordingly it is an object and feature of the present programs, constituting the computational load of an 
invention to provide a method to balance the load in an artificial neural network, and the like, are arranged into 
artificial neural network, and the like, wherein the com- process sets, at step 2. Each interconnection in such a 
putational load at each processing unit is uniformly 40 network may be visualized as a from-to combination 
distributed. whereby information is sent from one node to another 
It is also an object of the present invention to balance node. Thus, a from-to combination represents the flow 
the load in an artificial neural network, and the like, of information between two nodes. Indeed, for informa- 
whereby minimal memory is utilized during the execu- tion to flow through the network, all of the computa- 
tion of the computational load apportioned to each 45 tional loads at each connection must be executed. 
processing unit. The present invention subdivides the memory of a 
It is a further object of the present invention to pro- multicomputer system into a plurality of partitions, with 
vide a method to guarantee a balanced load and minimal each partition containing either read-only or read-write 
memory utilization in an artificial neural network, and memory. During the execution of a computation, in- 
the like. 50 structions and source data are read from particular re- 
It is a further object of the present invention to enable gions of “source memory.” Data changed as a result of 
parallel multiprocessor computers to accommodate a computation is written to particular regions of “target 
network configurations two to three times greater in memory.” Thus, source memory is read but not 
size than would otherwise be accommodated. changed, while target memory is changed. 
Accordingly, in a preferred embodiment of the pres- 
enable network configurations to execute on parallel ent invention, each memory partition is a disjoint subset 
multiprocessor computers with only a fraction of the of either the source or target memory. The memory 
normal utilization of memory, particularly with only contained in a partition is not shared with any other 
one third to one half of the memory normally utilized. partition in the computer system. The total memory 
It is a specific object of the present invention to pro- 60 available to this system corresponds to the sum of all the 
vide, in a parallel processing computer system including source and target partitions which is equivalent to the 
a plurality of processing units and shared memory, and sum of all source and target memory, which, of course, 
containing a network having identical computations to is in turn equivalent to the sum of all read-only and 
be executed at each connection therein, and said net- read-write memory. 
work further having a constant aggregate computa- 65 During the execution of each identical computation 
tional load. a method of up-front load balancing com- in the said network, in which a finite number of machine 
prising generating a first plurality of electrical signals cycles are executed, a process is performed whereby 
with each such signal functionally related to a corre- source memory and target memory are operated upon 
It is still a further object of the present invention to 55 
4,920,487 
7 8 
as hereinbefore described. The present invention col- from the number of processing units or input as an over- 
lects these processes, which operate upon the same riding value. 
regions of memory, into process sets of “packets” and Once the size of the array is established, at step 5, the 
merges them together to achieve the dual purpose of computational load required at each connection in the 
uniformly distributing the aggregate load and reducing 5 network is read. Each Load is placed into the load array 
memory utilization. This improved distribution of load based upon the connection’s read-only and read-write 
or load balancing is performed, in accordance with the memory references. More particularly, an Integer-Dis- 
present invention, once and only once, after which the tribute routine is invoked to determine the range of 
plurality of processes throughout the network may be read-only or source memory encompassed by the corre- 
independently executed exclusively by the processing 10 sponding source-indicies of the load array. Similarly, 
unit assigned to them. the Integer-Distribute routine is then invoked again to 
It is common for those skilled in the prior art to de- determine the range of read-write or target memory 
pict memory-partitioning by a two-dimensional array or encompassed by the corresponding target-indicies of 
grid. Accordingly, the partitioning of total memory, the load array. The steps comprising the Integer-Dis- 
under the concept of the present invention, is repre- 15 tribute routine are given in Table 1. 
TABLE 1 sented by a two-dimensional array with source memory positioned along the vertical axis and target memory 
positioned along the horizontal axis. Thus, each of the Integer-Distribute Routine 
source and target memory is subdivided into partitions, Step Descnptlon 
depicted by rectangular regions in the array. It will 20 
become apparent that each such region represents one 
source partition and one target partition. 
At step 20, the number of process sets is reduced until 
the number of processing units and the number of pro- 
cess sets is equal. At step 40, the results of step 20 are 25 
saved for possible later use. Step 45 determines whether 
another cycle could produce a better result, namely 
uniformly distribute the load among the processing 
units with a lower memory partition threshold. 
a 
b 
c 
d 
e 
f 
g 
h 
i 
k ~~$~~~ :tgoEO+V$R - 
1 Go to step g cept of the present invention, is the indicia of the area in 
the memory partition array which may be assigned to 
each processing unit. Stated alternatively, the partition 
threshold represents the number of regions, and the If there is more input, the next computation is and 
proximity thereof to each other, for which each pro- 35 stored into the load array based upon its source and 
cessing is responsible. Thus, a lower partition threshold target memory references. The cumulative load is then 
is ‘‘better’’ than a higher threshold value because a incremented by the value Of this load. Accordingly, all 
lower value means that fewer regions of memory are of the input is stored into the appropriate locations in 
prerequisite for the execution of the computations as- the load array. When this load array initialization is 
signed to a particular processing unit. 40 completed, at step 6, the balance load is ascertained 
Again referring to FIG. 1, if a lower partition thresh- from the quotient of the cumulative load and the num- 
old can be achieved by executing another cycle, step 50 ber of processing units. Since fractional balance load is 
resets the list of process sets to the results of step 2 and not permitted, the balance load is rounded to the next 
restarts the cycle at step 20. If a lower partition thresh- higher integer value. 
old cannot be achieved from executing another cycle, 45 Still referring to FIG. 2, at step 7, the computational 
step 55 selects the lowest partition threshold from the loads are grouped into process sets based upon the re- 
results saved from step 40. gion of memory in which their packets are contained. 
Referring now to FIG. 2, there may be seen, in First, an array for holding the memory partitions, a 
greater detail, the methodology which arranges the partition array, is created with the same NxN dimen- 
computational load into process sets, depicted in FIG. 1 50 sions as the load array. Then for each region in the load 
as step 2. The computational load is subdivided into array the load is tested for a zero value. If the value of 
component parallel processes at step 3. For a neural the load for a particular region is zero, the next region 
network, of course, the identical, parallel processes for in the load array is tested. If the value of the load for a 
each connection between neurons are known a priori. particular region is nonzero, the packet is added to a list 
Accordingly, in the preferred embodiment of the pres- 55 of process sets. 
ent invention, these parallel processes are contained in This value of the load is compared with the value of 
an input fie. the balance load, at step 8. If the load is less than the 
At step 4, the read-only and read-write read from balance load, the region of the partition array corre- 
another input file and total memory is subdivided into sponding to the current region of the load array is set 
N2 regions, whereby a load array of NxN regions is 60 equal to the list of packets for the said region of the load 
created. While N may be selected arbitrarily, it has been array. If the Ioad is not less than the balance load, this 
empirically ascertained that an approximate starting overloaded process set is deleted from the current list of 
value calculated as the nearest integer greater than the process sets, at step 9, and then the Integer-Distribute 
square root of twice the number of processing units is routine is invoked to propagate packets of equal load 
preferred. For example, in a computer system with 65 obtained by taking the quotient of the particular load 
forty processors. the “nearest” integer square root to and the balance load. 
twice forty, i.e., eighty, is nine, since nine is the square Still at step 9, the newly created process sets are 
root of eighty one. Thus, N may either be calculated added to the current list of process sets. The process set 
TOTAL = total number of elements to be distributed 
N = number of chunks into which TOTAL must be 
subdivided 
L = list of chunks returned 
Set BASE = TOTAL / N 
Set LEFTOVER = TOTAL - (BASE X N) 
Make list L, N units long, each unit with a SIZE 
of BASE number of elements 
Determine if LEFTOVER is equal to zero 
If LEFTOVER is equal to zero, return 
If LEFTOVER is not equal to zero, find the first 
unit of list L with size of BASE 
The partition threshold, in accordance with the con- 30 
4,920,487 
9 10 
to which each packet belongs is recorded. Steps 8 and 9 date sets are chosen which collectively can off-load the 
are repeated for each region of the load array and a residual load, provided that the said collection of sets 
corresponding entry is written into the partition array. have the most partitions in common with the residual 
After all of the regions in the load array have been read, process set. The steps comprising the Combine routine 
at step 10, a copy of the partition array is saved for 5 are given in Table 2. 
TABLE 2 resetting purposes. Referring now to FIG. 3, there may be seen, in 
Combine Routine greater detail, the methodology which reduces the 
number of process sets to be equal to the number of 
processing units, depicted in FIG. 1 as step 20. At step 10 
Step Description 
a A = set to off-load via merger 
21, the process sets &e sorted by load and placed into 
list. The preferred embodiment of the present invention 
uses an insert sort which is well known to those skilled 
in the art. The insert sort is described in The Design and 
Analysis of Computer Algorithms, written by A.V. Aho, 15 
J.E. Hopcroft and J.D. Ullman, and published by Addi- 
son-Wessley in 1974. It should be clear that a variety of 
sort algorithms are known in the prior art and may be 
used in the present invention. 
is set to its lowest value of 2. Next, at step 22, the num- 
ber of process sets is compared with the number of 
processing units. If the number of process sets is less 
than or equal to the number of processing units, then the 
results are saved in step 40 of FIG. 1. If, as usual, the 25 
number of process sets is greater than the number of 
processing units, then as shown in step 23, the process 
set with the lowest load is found in the list of process 
sets Once found, this lowest load process set is deleted 
from the list of process sets, and then subdivided into 30 
Still referring to step 21, the initial partition threshold 20 
multiple processsets, one for each packet. Each such 
process set, which has a load equal to the load of its 
packet, is off-loaded by attempting merging with the 
other process sets in the said sorted list. 
More particularly, in step 24, each of these subdi- 
vided process sets created from the lowest load process 
set, is attempted to be off-loaded by being merged with 
the other process sets. Using the list of packets con- 
tained in the partition array, a search is made for a 
packet in the partition array which shares the same 
source and target partitions as the process set to be 
off-loaded. but belongs to a different process set. At step 
25, it is determined if such a packet has been located in 
the partition array. If there is such a packet, as much 
load as possible is off-loaded without exceeding the 
balance load, at step 26. At step 28, it is determined if 
there is any residual load in the process set being off- 
loaded. 
If there is residual load in the process set being off- 
loaded, then a set of merge candidates is sought. These 
merge candidates are chosen from all of the other pro- 
cess sets which have a load less than the balance load. 
For each candidate, merger with the residual load is 
attempted. For process sets which do not exceed the 
balance load, the partition-sum of the potential merger 
is calculated by adding one for every unique source and 
target memory partition in the combined process set. If 
any of the trial mergers exceed the partition threshold 
for a process set, they are discarded. 
The surviving set of merge candidates corresponds to 
all of the process sets which can off-load all or part of 
the residual load without exceeding the balance load. 
From this surviving set, the merge candidate which has 
the most partitions in common with the set to be off- 
loaded is chosen. To promote this choice being made, 
the set is sorted by ascending Ioad and by ascending 
difference in partitions. If there are no candidate sets 
satisfying this criterion, then any combination of candi- 
35 
40 
45 
50 
55 
60 
65 
b 
d 
e 
f 
B = set to be combined with A 
Delete B from list of process sets 
Set “owner” of packet of A = B 
Add packets of A to list of packets of B 
Resort B by load into list of process sets 
C 
If however, thoro is no combination of candidate 
process sets which can affect the off-loading without 
exceeding the partition threshold, the threshold is in- 
creased by one, at step 29, and the merge procedure, 
starting at step 24, is repeated. 
As is apparent to those skilled in the art, once the 
process sets participating in this merger operation have 
been identified, they are deleted from the list of process 
sets, at step 28. At step 27, the remaining list of process 
sets is again sorted by load using an insert sort as herein- 
before discussed at step 21. Hence, only the new process 
sets are actually sorted. 
Referring again to FIG. 1, at stop 45, for each itera- 
tion which commences with a particular partition 
threshold and ends with another, higher partition 
threshold, another iteration is performed with the start- 
ing partition threshold incremented by one. This cycle 
continues until it is clear that another attempt cannot 
yield a lower ending threshold. At step 55, the best 
answer is obtained from the saved results in step 40. 
It should be clear to those skilled in the art that the 
present invention finds an approximate solution to a 
difficult and formerly intractable problem. 
It will be shown that the maximum execution time for 
the present invention is linearly related to the size of the 
input for the initialization phase in which the load and 
partition arrays are established. It will further be shown 
that an unexpected advantage of the preferred embodi- 
ment is that the execution time for the remainder of the 
method taught by the present invention is independent 
of the size of the input. 
As hereinbefore described in detail, the preferred 
embodiment of the present invention uses an iterative 
merging of process sets using a starting value of the 
partition threshold. The partition threshold is used as 
the indicia of the memory required for execution of 
each identical computation by the processing units. 
After each iteration, the list of process sets has been 
reduced and reset accordingly. The initial partition 
threshold is then incremented, and another iteration 
made. 
It should be clear that the partition threshold concept 
of the present invention functions as a guide for merg- 
ing process sets. More particularly, it guides which 
process sets should be merged with the process set 
being off-loaded, by focusing on the total partition area 
which would be in effect after the purported merger. 
The present invention has an inherent bias against the 
partition threshold being exceeded during the merger 
phase. If, however, no merges are possible for a particu- 
lar partition threshold, then the partition threshold must 
be incremented for merging to occur. 
4,920,487 
11 12 
The lower the partition threshold, the lower the par- economic constraints. The selection of the number of 
tition area occupied and the more compact the resulting partitions depends upon the importance of turnaround 
combined process sets will be. However, a lower parti- time in the particular operating environment. Once the 
tion threshold means that the merging phase is more number of processing units and the number of partitions 
constrained. Thus, merges which occur prior to the 5 have been chosen, the time required to execute the load 
partition threshold being incremented may be passing balancing method of the present invention depends only 
up merge opportunities which could produce a more thereon. 
optimal overall reduction in memory. The goal is to As hereinbefore described, the initialization proce- 
select an initial partition threshold which allows all dure generates as least one process set for each nonzero 
merges to occur without having to increment the 10 region of the load array. Accordingly, if none of the N 
threshold value during the iteration. partitions has a zero load, the maximum number of 
Since the optimal partition threshold is not known a process sets is N2. However, when the loads of the 
priori, the preferred embodiment is executed multiple partitions are unequal, some of the partitions may con- 
times with progressively higher threshold values. The tain multiple process sets. Since the balance load is 
iteration in which the final partition threshold and the 15 calculated using the number of processing units, P, as 
initid partition threshold are the same, terminates exe- the divisor, it is clear that the worst distribution would 
cution and yields the best load balancing solution. As generate p-1 remaining process sets. Stated alterna- 
has hereinbefore explained, this solution includes a uni- tively, the worst distribution, i.e., the distribution with 
form distribution of the computational load among the the largest result, would be found by taking the modulus 
processing units, with minimal (albeit suboptimal) mem- 20 with respect to p. Such distribution is P- 1. Anything 
ory being utilized by these processing units. larger yields the same or a smaller number of process 
After the iteration in which the final partition thresh- sets with a larger balance load. Hence, the maximum 
old and the initial partition threshold are the same is number of initial process sets is N2+p- 1. 
reached, no subsequent iterations will produce an envi- The novel reduction phase of the preferred embodi- 
ronment which is less constrained. Since the partition 25 ment reduces the initial list of process sets to the number 
threshold was unchanged, all attempted merges were of processing units. Since the initial list cannot be longer 
effectuated during that iteration. than N2+ P - 1, the number of merges is represented by 
Of course, for iterations in which no merges or an ('2+p- 1) -p  or N2- 1. As hereinbefore described in 
insufficient number of merges can O C C W  the Present detail, each merge involves searching all other process 
invention increments the partition threshold by one and 30 sets which is limited by the lemgth of N2+p- 1. It is 
then restarts the current iteration. This incremented thus clear that the total time mecessary to reduce the list 
value of the partition threshold corresponds to a dy- of process sets to length P is (N2 - 1) x (N2 + p - 1). 
namic adjustment Of the allowable merge Proximity Now consider the time necessary to reset the list of 
limit. Clearly, lower partition threshold values prelude process sets after each iteration. The maximum length 
merges for memory regions containing packets which 35 ofthe list is N2+p- 1. To arrive at the maximum num- 
are too far apart. 0x1 the other hand, higher Partition ber of iterations, it should be clear that an iteration with 
threshold values allow merges between Process sets either the lowest partition threshold of one or the high- 
which Contain Packets contained in regions which are est partition threshold of 2N, is unnecessary. This is, of 
far apart, thereby utilizing too much memory. course, because the number of partitions must be at least 
T O  arrive at an estimate of the operating time of the 4.0 two and the total number of partitions is 2N. Hence, the 
preferred embodiment, the reduction and resetting maximum number of iterations is 2N-2 and the total 
times as well as the number of prerequisite iterations time required for the balancing iteration is obtained 
must be determined. For the operation of the preferred thus: 
embodiment in a given application environment, the 
number of processing units and the number of partitions 45 
are constant. 
the number of partitions is approximately the square 
root of twice the number of processing units. Thus, the which is clearly a function of only the number of parti- 
execution of the preferred embodiment is optimal when 50 tions, N, and processing-units, P. The present invention 
the NxN size of the array is approximately twice the has the important advantage that regardless of the size 
value of the number of processing units. of the network, the execution time required to balance 
As an example, for a computer system with four pro- the loads during the merger phase remains constant for 
cessing units, the recommended number of partitions a given multiprocessor configuration. 
for each of read-only and read-write memory, is three. 55 Empirical results indicate that the execution time of 
This, of course, corresponds to a 3x3 array of nine the preferred embodiment is significantly below the 
regions. For computer systems with processing units of (N2) x (N*+ P- 1) x (2N-2) maximum. This is because 
eight and twelve, the recommended number of parti- the number of iterations actually required to achieve a 
tions for each of read-only and read-write memory are balance load is typically Less than 2N - 2. Most config- 
(our and five, respectively. As still another example, the 60 urations are indeed balanced within only one third of 
recommended number of partitions for a computer sys- the maximum possible iterations. That is, within, one 
tem with forty processing units is nine . third of the maximum iterations, a sufficiently low parti- 
The greater the number of partitions used in the pre- tion threshold is achieved whereby further iterations 
ferred embodiment the more choices there are for the are unnecessary. 
merging of process sets. However, the greater the num- 65 For example, if a fourth iteration were to yield a 
ber of partitions, the longer the execution time. The partition threshold of six, continuing past the fifth itera- 
selection of the number of processing units, of course, tion would be fruitless because a lower threshold value 
depends upon the available hardware resources and would be impossible. Furthermore, an iteration is 
[(N2 - 1) X (N2 + P - 1) + (N2 + P - l)] X (2N - 2) 
As hereinbefore stated, the recommended value of (N2) X (N2 + P - 1) X (2N - 2) 
4,920,487 
13 14 
aborted by the preferred embodiment if, during execu- 
tion, the partition threshold is raised above the current 
best value. The partition threshold represents the extent 
to which memory will be prerequisite for the execution 
of the computational load by the processing units. 
should be ,-lea, memory saved by the present 
invention is measured by the final partition threshold. 
for which each of read-only and read-write memory are ity of processing units in said network, 
subdivided into ten partitions. If the final generating a first plurality of signals with each such 
signal representing a corresponding one of said threshold achieved is five, and since total memory con- identical computations in said network, sists of twenty partitions, then the memory savings is generating in response to said first signal and said first 
plurality of signals a second signal representing a calculated thus: 1 -5/20=75%. Instead of each pro- cessing unit requiring up to all of memory for execution 15 balance load in said network, 
tion of the load is borne by each processing unit in no establishing a preselected plurality of partitions of 
said memory, more than five partitions of memory. 
In approximately fifty test cases of typical artificial generating in response to said first and second plurali- 
neural networks the preferred embodiment yielded 20 ties of signals, a first sequence of signals function- 
memory savings of sixty to seventy five percent of the ally dividing said computational load into a plural- 
runs. Most of the configurations in these test cases con- ity of process sets, 
sisted of forty processing units and twenty total parti- generating in response to said second signal and said 
tions of memory, i.e.. ten partitions of read-only and ten first sequence of signals, a second sequence of sig- 
partitions of read-write memory. For those cases in 25 nals functionally allocating said process sets among 
which an optimal solution could be independently ob- said memory partitions, and 
tained via the enumeration method, the load balancing generating in response to said second sequence of 
method of the preferred embodiment deviated only signals, a third sequence of signals functionally 
fifteen percent from the optimal solution. Accordingly, merging said process sets until they are equal in 
the present invention enables networks two to three 30 number to said plurality of processing units. 
times larger than would otherwise be possible to be 2. The method described in claim 1, wherein said 
executed on a computer system. merging of said process sets includes comparing each 
As hereinbefore described, the execution time of the signal of said third sequence of signals to said second 
present invention increases with the number of parti- signal to detect whether said balance load is exceeded. 
tions. Yielding a more optimal reduction in memory 35 3. The method described in claim 2 wherein said 
with larger number of partitions, the present invention merging of said Process sets includes generating a third 
affords a trade-off between turnaround time and mem- signal functionally related to a preselected partition 
ory reduction. As an example, consider the empirical 
results obtained from runs of an artificial neural net- 4. The method described in claim 3, wherein said 
work with 160 nodes and 6,000 connections operating 40 merging of said process sets includes generating in re- 
on a computer system with forty parallel processing sponse to said third signal, a fourth signal functionally 
units. In the run in which memory was subdivided into related to a current partition threshold. 
5. The method described in claim 4, where said merg- ten partitions, memory reduction of 74% was achieved ing of said process sets further includes generating a by the preferred embodiment in 7.5 seconds. By con- 
45 fifth signal functionally related to the difference be- 
1. In a parallel processing computer system including 
a plurality of processing units and shared memory, and 
containing a network having identical computations to 
be executed at each connection therein, and said net- 
5 work further having a constant aggregate computa- 
tional load, a method of up-front, load balancing com- 
prising 
As an illustration, consider a computer configuration generatkg a first signal representative of said plural- 
10 
Of its 'located computational load, Only an frat- generating a second plurality of signals functionally 
trast, in the r m  in which memory was subdivided illto tween said third signal and said fourth signal to detect twice as many partitions, memory reduction Of 87% whether said preselected threshold differs was achieved in 4 minutes, 7 seconds. Thus, an in- from said current partition threshold. creased InemOI'y reduction of 17% Was obtained by 6. The method described in claim 5, wherein said 
50 merging of said process sets includes increasing in mag- the number Of partitions from ten to 
but the turnaround time was increased by several orders nitude said fourth signal by an amount functionally 
of magnitude. related to incrementing said current partition threshold 
Other variations and modifications will, of course, by a preselected amount. 
become apparent from a consideration of the features 1 including generat- 
and steps hereinbefore described and depicted. Accord- 55 ing a fourth sequence of signals functionally ordering 
inglY, it should be clearly understood that the Present said plurality of process sets by ascending computa- 
invention is not intended to be limited by the particular tional load. 
features and steps hereinbefore described and depicted 8. The method described in claim 7 further including 
in the accompanying drawings, but that the concept of comparing said fourth sequence of signals to said third 
the present invention is to be measured by the scope of 60 and fourth pluralities of signals functionally ascertain- 
the appended claims herein. ing which computational loads to merge together. 
7. The method described in 
I claim: * * * * *  
65 
