Dartmouth College

Dartmouth Digital Commons
Computer Science Technical Reports

Computer Science

1-22-1997

Automated Parallelization of Discrete State-space Generation
David M. Nicol
Dartmouth College

Gianfranco Ciardo
College of William and Mary

Follow this and additional works at: https://digitalcommons.dartmouth.edu/cs_tr
Part of the Computer Sciences Commons

Dartmouth Digital Commons Citation
Nicol, David M. and Ciardo, Gianfranco, "Automated Parallelization of Discrete State-space Generation"
(1997). Computer Science Technical Report PCS-TR97-304. https://digitalcommons.dartmouth.edu/cs_tr/
144

This Technical Report is brought to you for free and open access by the Computer Science at Dartmouth Digital
Commons. It has been accepted for inclusion in Computer Science Technical Reports by an authorized
administrator of Dartmouth Digital Commons. For more information, please contact
dartmouthdigitalcommons@groups.dartmouth.edu.

Dartmouth College Department of Computer Science Technical
Report PCS-TR97-304
Automated Parallelization of Discrete State-space Generation
David M. Nicol
Gianfranco Ciardo
Dartmouth College College of William and Mary
Hanover, NH 03755 Williamsburg, VA 23185
January 22, 1997

Abstract
We consider the problem of generating a large state-space in a distributed fashion. Unlike
previously proposed solutions that partition the set of reachable states according to a hashing
function provided by the user, we explore heuristic methods that completely automate the
process. The rst step is an initial random walk through the state space to initialize a search
tree, duplicated in each processor. Then, the reachability graph is built in a distributed way,
using the search tree to assign each newly found state to classes assigned to the available
processors. Furthermore, we explore two remapping criteria that attempt to balance memory
usage or future workload, respectively. We show how the cost of computing the global snapshot
required for remapping will scale up for system sizes in the foreseeable future. An extensive set
of results is presented to support our conclusions that remapping is extremely bene cial.

1 Introduction
Discrete systems are frequently analyzed by generating and examining their underlying state-space.
While a valuable technique, its most serious weakness is that large complex systems may require
the generation and analysis of tremendously large state-spaces. There is relatively little locality
of reference in the generation process, which means that ordinary virtual memory systems thrash
once the generated state-space exceeds the available main memory. We are interested in extending
the size of state-spaces amenable to generation by exploiting the aggregate memory available in a
distributed-memory computer system. In earlier work we showed how this could be accomplished
This work was supported in part by NSF grants CCR-9201195 and NCR-9527163, in part by a subcontract on the
grant 96-SC-NSF-1011 to the Center for Advanced Computing and Communication, and in part by NASA Contract
S1-19480 to the Institute for Computer Applications in Science and Engineering.


1

using a user-de ned hash-function that statically partitions the graph among processors[2]. Given
a generated state (typically a vector of integers), the hash-function identi ed the processor to which
the state was statically assigned. While we showed that a well-tuned hash-function can e ectively
balance the workload partition and achieve reasonable execution time eciencies as well, the method
su ers from some obvious drawbacks. The user must have a thorough understanding of how the
state-space will develop in order to craft a hash-function that both balances the graph partition,
and keeps the number of graph edges that cross processor boundaries low. If ever in the course of
generation just one processor is so burdened with states that it exhausts its available memory, the
whole computation fails. If in the course of generating the graph it transpires that the generation
process is not exploiting much parallelism, nothing can be done. Finally, the requirement of a userde ned function for every di erent modeled system makes development of a general distributed
graph generation tool problematic.
In this paper we develop techniques that automate parallelization of state-space generation.
Our methods include innovations in the mapping of states to processors, and policies for remapping
workload. Our methods have been embedded in a general distributed state-space generation tool
which itself has been integrated as the back-end of a stochastic Petri net modeling tool, spnp[4].
We examine the utility and costs of dynamic remapping for the tool executing on an IBM SP-2,
considering e ectiveness in both memory balancing and run-time parallelism. We nd that our
automated methods are much better than optimally hand-tuned static methods with respect to
both memory balance and speedup.
The remainder of this paper is organized as follows. Section 2 provides background information
on the mechanics of state-space generation and its distributed implementation, and sketches related
work. Section 3 describes our method for automatically assigning states to processors. Section 4
then focuses on remapping. Section 5 describes some essential details regarding message-passing
in our implementation. Section 6 reports on experiments we conducted, Section 7 develops some
analytic results related to this work, and Section 8 gives our conclusions.

2 Background
Many discrete systems are analyzed formally by considering their full underlying state-space. Typically a \state" of the system is a vector of integers. For instance, in a queueing network the state
is a vector of the queue lengths, one component per queue; in a Petri net the state is a vector
of token counts, one component per place in the Petri net; in a system of communicating nite
state machines the vector contains components describing the internal states of those machines.
From a state, it is usually possible to transition other states. A transition may be caused by an
external arrival to the system, or by the internal stochastic evolution rules of the model. In both
queueing networks and Petri nets, for instance, the system remains in a state s for a random period
of time and then chooses some new state s0 that is reachable from s; a job moving from one server
2

first state

search
tree

explored
states

next state
to explore
position
of next
unexplored
state found

found
states

unexplored
states

Figure 1: Organization of explored, unexplored, and found states.
to another changes the vector of queue lengths describing a queueing network, a transition ring
changes the marking of a Petri net. In the following, by \state-space" we mean the directed graph
whose nodes are the reachable states and whose edges describe transitions between states, the edge
being directed towards the state resulting from the transition represented by the edge.
The state space for a system may be in nite|a single queue with in nite bu ering can conceivably build up beyond any pre-speci ed queue length given a rapid enough set of arrivals. We
concern ourselves here with the problem of generating nite state-spaces, although the methods
will work equally well for in nite spaces if one includes a rule for terminating their expansion.
State spaces are typically built using breadth- rst generation. Initially, some starting state is
placed on the list of \unexplored states". The starting state is also used to initialize a tree that
will organize the \found (up to that point) states" for an ecient search. The generation process
then enters a loop that terminates only when the list of unexplored states is empty. At each loop
iteration, the rst unexplored state is removed and added to the list of \explored states", and (using
rules speci c to the problem domain) all states reachable from it in one transition (i.e., children)
are generated. Each of them is then searched in the tree; if a child is found to have been generated
already, then only a new graph edge is introduced, rooted in the child's parent and directed towards
the existing node. If a child is not found in the tree, then it is rst inserted into the tree, before
adding a state-space edge directed to it from its parent, and the child is placed at the end of the list
of unexplored states. Figure 1 illustrates the relationship between explored states, the next state
to explore, and the next location to place a newly generated state. The search tree is extended and
balanced through manipulation of pointers, the states themselves do not move within the illustrated
layout. We stress that the order in which states are found is in no way related to the lexicographic
order used in the search tree.
The application-speci c part of state-space generation can be separated from the common aspects through a well-de ned interface, such as that we describe in [2]. One bene t is the immunization of the system modeler from details concerning parallelization. The techniques we describe in
this paper follow in a similar vein and have been implemented in a manner that supports integration
3

with any modeling front-end that correctly uses the interface.
Di erent analyses are conducted once the graph is generated, depending on the application. In
models of communication protocols one looks to see whether any of the reachable states represents
deadlock. Livelock is also possible and is detected by identifying transient states (ones that may
become unreachable once left). Resource requirements can be extracted from an understanding
of the relationship between state expression and resources. Parallel algorithms for performing
these types of analyses are discussed in our earlier paper [2]. When the state-space represents
an ergodic continuous-time Markov chain one typically solves a set of linear equations (the global
ow-balance equations [22]) that associate a stationary probability with each node in the graph.
Methods for solving linear equations in parallel are standard and will not be discussed here except
to say that good performance can be expected if the graph clusters well and the states are evenly
balanced among processors. Indeed, we anticipate that Markov analysis will be one of the principle
applications for our methods; our emphasis on balancing memory utilization derives in part from
our recognition that balanced memory is required to balance the workload of the linear system
solution phase.
One aspect of distributed state-space generation sets it apart from other types of computations
that are usually distributed|the need to determine whether a newly generated state has been
generated before. For, discovery that state B is a child of state A establishes a unique and previously unknown link between A and B , but state B may also be a child of state C , and so may
already be represented somewhere in the system. This is one of the key di erences between serial
and distributed state-space generation; a newly generated state must be sent to some processor
responsible for that state, to determine whether this is a new or repeated instance of the state.
Maintaining a global directory of states is not practical, instead we must use the state description
itself to determine its location. Our previous approach was to hash the state vector into the range
of integer processor identi ers, using a user-de ned hash function. When a generated state maps
to a di erent processor, the state and edge information is sent to the correct location.
As automatic parallelization cannot depend on a user de ned hash function, a di erent approach
is needed. We see two requirements. First, the location of any state must be determined quickly,
without interprocessor communication. Furthermore, we must map states so that a child state
is likely to be mapped to the same processor as its parent, otherwise we will be overwhelmed by
inter-processor communication. Second, we need to support dynamic remapping of workload, and
descriptions of known states. The remapping problem is hampered by the fact that future execution requirements are unknown, and unknowable, because the structure of the as-yet-undiscovered
portion of the state-space is not known.
Another dimension of the remapping problem is to determine what to remap, how to remap,
and when to remap. These issues are developed in the following sections, after we discuss related
work.
Early work on load-sharing had a world view of independent jobs in a distributed system.
4

The main interest was in job migration policies [23, 6, 13] that best utilized system resources.
The algorithms were asynchronous, workload moved without global synchronization and without a
global view of the system; key issues were how transfers are initiated, and what information is used
to govern those initiations. More recent work has a di erent view of workload, but has continued
in the asynchronous vein [24, 12]. A synchronous view of remapping was taken in work on decision
policies that focus on when to remap [18, 19, 20]; balancing the delay cost of remapping against
the anticipated performance gain is the essence of these policies. Globally synchronous remapping
techniques are developed in [5, 25]. These methods iterate; at each iteration, pairs of processors
balance workload between them, and ultimately some global sense of balance is achieved without
any of the processors ever having had a global view of the system.
Our application and parallelized branch-and-bound computations share the problem that at a
point when one remaps, future workload is unknown. Various methods for approaching parallel
branch-and-bound have been studied [11, 10, 14, 21]. A critical di erence between our application
and branch-and-bound is that we must store the state-space we explore, that is not typically done
in branch-and-bound. A branch-and-bound algorithm may generate a given subproblem (and then
all of its descendents) multiple times; we do not. Unlike branch-and-bound, when we generate a
state, we must determine whether that state has been generated before. We have semantic reasons
for achieving some level of locality between our states, branch-and-bound is not so constrained.
Our approach is to exploit the high connectivity that modern interconnection networks provide,
and compute remappings synchronously, and directly. Each processor has a picture of the global
load distribution, and when implementing a remapping, a piece of workload or data goes directly
from its source processor to its target processor. Direct methods have been considered in the
context of large scale parallel processing systems [7, 1, 15], but these do not take a global view.
While our approach is pragmatic and driven by available technology, we will argue, at the end of
Section 7, that it scales up to moderately large parallel systems and so has some future value as
well.

3 Automated State Mapping
To help understand the intuition behind our methods we rst brie y review the mechanics of
balanced trees. As our implementation uses AVL trees [9], our description applies speci cally
to these, as well as certain other commonly used search trees. One uses an application-speci c
comparison operator such that for any state vectors A and B , either A < B , A > B , or A = B .
Given a tree and some state vector C , we determine whether C is in the tree by comparing it with
the tree's root, R. If C = R we are done, otherwise we recursively search the left subtree of R if
C < R, and the right subtree otherwise. If we nd that C is not in the tree, it is inserted into
the tree at the \exit point" of the last comparison. That is, if state vector B was the tree member
against which C was last compared, then B becomes C 's parent in the tree; C is B 's left child if
5

(3,1,0,0)

(3,0,1,0)

(3,0,0,1)

class 1

(4,0,0,0)

class 3

class 4

class 5

class 2

Figure 2: Balanced tree and de nition of classes via exit points

C < B and is its right child otherwise. The tree may balance itself following the insertion. We will

say that a state in the tree has a left exit point if it has no left child; a right exit point is similarly
de ned.
For state-space generation with vector-valued states, it is typical to use lexicographical ordering
to compare two states. If A = (a1 ; a2;    ; an ) and B = (b1; b2;    ; bn), then one determines the
ordering exactly as one would if A and B were n-digit numbers, scanning from left to right looking
for the rst coordinate position in which A and B di er, using the comparison of the values in that
position to determine the vector ordering.
We replace the hash function of our earlier work with a classi er based on balanced tree mechanics. We will map (and remap) classes, not individual states. Classes are de ned with respect
to a control set of states organized as a search tree. The same control set is built by each processor
during an initialization phase, and does not change throughout the subsequent graph generation
phase. The exit points of its tree are enumerated. Any state found during the breadth- rst exploration is classi ed by searching for it in the control set's tree. If found in the control set, it is
considered to be in class 0. Otherwise, the identity of its exit point de nes its class, and its current
processor is retrieved from a class-to-processor lookup table. Figure 2 illustrates a control set tree
with four nodes and the induced ve class de nitions.
To keep communication costs down, it is desirable that a child and parent frequently map to
the same class. At the same time, we need states to be spread among classes to give us exibility
in remapping. Our choice of control set and its organization attempts to achieve these con icting
goals. A poor choice for the control set might map nearly all states into a very small number of
classes (a fact we discovered the hard way); observe that, in Figure 2, any state with a 0, 1, or
2 as its rst component is mapped into class 1. Understanding the tree search as being nearly
equivalent to a hashing function, we see that the control set needs to capture characteristics of the
whole state-space in the same way as a hashing function does. To acquire these characteristics we
create the control set from random walks.
At initialization, each processor generates a number of random walks though the state-space.
6

States hit by a random walk are saved. Each random walk begins at the initial state; given the
last state visited by the walk, its children are generated. One of the children is selected uniformly
at random to be the next state in the random walk and its siblings are discarded1 .
To further avoid any systematic over-clustering of states, we randomize the lexicographical
ordering. Using synchronized random number generator seeds, each processor generates the same
permutation vector idx[] to randomize the ordering of state components. Thus, idx[0] holds
the index of the rst state component examined in the lexicographical comparison, idx[1] holds
the second index examined, and so on. We found this a valuable tool to protect us from the
potential problems arising from correlation of states' components (consider|hashing on a vector of
components whose values are highly correlated e ectively diminishes the \spreading" of the hashing
function. Lexicographical ordering is like hashing in this regard.) Indeed, the \natural order" in
which they would be considered if we were not permuting them (e.g., the order in which the places
are speci ed in a Petri net model), often re ects the modeler's thinking of the system, and it is not
likely to be truly \random". This is a case where a truly uninformed decision is instead desirable.
Indeed, experiments where we generated orderings that encourage locality by placing components
likely to change in a transition at the end of the lexicographical ordering were too successful, in
that they created classes too large to move. Until our mechanism can dynamically split classes
(which we discuss later) it seems that randomization is the most sensible approach.
Despite the extensive randomization, this method of class de nition will induce a certain level
of locality among states in a common class. For a moderately sized control class, class de nition
is likely to be entirely determined by comparison with indices early in the comparison sequence.
There is a reasonable chance that a child will reside in the same class as its parent for the simple
reason that the di erence between the parent and child vectors is usually restricted to a few state
components, and if classes are determined by entirely di erent state components, the child and
parent will be in the same class. We formalize this intuition in Section 7 where we derive a lower
bound on the probability that a child resides in the same class as its parent.
The processors collectively merge all saved states on all processors to de ne the control set, and
build a tree over this set using the randomized lexicographical comparison function. Each processor
then traverses the tree, evenly partitioning the member states among processors for exploration and
evenly partitioning the classes among processors to build the rst class-to-processor mapping table.
No communication is involved: each processor has the same control set organized the same way,
it executes the same algorithm as all the others to determine which states seed its exploration
list, as a function of its processor identi er. The processors synchronize globally to terminate the
initialization phase, and then begin the distributed generation.
A class-based mapping approach requires modifying the internal organization. A single tree per
We stress that if the graph represents a Markov chain, this di ers from the probability that a given child is
actually reached: we are now interested in strictly structural properties of the state-space, not in the stochastic and
timing behavior of the graph.
1

7

list of found states in class 1

list of
blocks
with
unexplored
states

list of found states in class 2

explored
states in
class 2

explored
states in
class 1

unexplored
states in
class 2
unexplored
states in
class 1

unused

unused

Figure 3: Block-oriented organization of class-grouped states in a given processor, assuming that
it is assigned classes 1 and 2.
processor is inadequate, since it makes it inecient to identify members of a remapped class and
remove them. Instead we use an independent search tree for each class. We cannot pre-allocate
memory for classes, so we partition memory into blocks of contiguous states that are allocated
individually to processors.
The overall approach to distributed generation di ers slightly from the hash-function based
approach. First, the mechanism for identifying a state's location is di erent, as we have described.
Second, our internal method for uniquely identifying a state (for the purposes of edge description) is
di erent. Earlier, the source of an edge was described by a counter-processor pair (e.g., the 12664th
state on processor 23); since states remap, we now use a counter-class pair. A third di erence is
management of state exploration. Despite the partitioning into state-blocks, the states associated
with a class can still be viewed as being ordered (logically) contiguously with a pointer to the
next state to explore and a pointer to the next place to insert a new state. When new memory is
assigned to a class, an entire new block is assigned, and the block is lled sequentially. All state
blocks of all classes holding unexplored states are on a singly linked list; whenever a new state block
is allocated it is attached to the end of this list. The state expansion loop works o the linked list,
exploring each state in the head block until exhausted, after which the head block is removed, and
processing is applied to the new head block.
When a processor exhausts its list of explorable states it engages in distributed termination
logic. Of course, it may be that there is still ample workload elsewhere in the system, some of
which will be remapped at the next remapping epoch. There are a variety of methods one might
use to detect termination; as long as its cost is not intrusive, the choice matters little. We use
the non-committal barrier synchronization [16] as that was easy to integrate into the distributed
8

generation logic.
The state-classi cation method may be extended to allow the control set to grow dynamically,
an extension we hope to investigate in the near future. If it transpires that too many states are
mapped to a class, one can \split" the class in an intuitive way. Since the class is organized as a
balanced tree; the root of that tree is promoted to the control class, and the left and right subtrees of
that root de ne new classes. The two new classes have nearly equal sizes since the tree is balanced.
The promotion may cause the control set to be rebalanced, but such a balancing (at least with
tree semantics) does not change any previously de ned classes, and one is assured that the newly
promoted state remains a leaf node in the control set.

4 Remapping
Remapping requires policies governing when to map, how to remap, and what to remap. The latter
consideration depends on one's objectives: a policy optimized to balance memory utilization is
di erent from one optimized to reduce execution time, which is di erent from one optimized to
remap only when one processor is in danger of exhausting its memory space. We will consider
remapping all classes, based on known size (to balance memory utilization), and remapping only
classes with unexplored states (to balance anticipated future execution costs); in the following
discussion we simply speak of a processor's \load" with the understanding that either notion applies.
A variety of remapping problems have been explored in the literature, see [26] for a survey. One
important characteristic is whether the method is synchronous or asynchronous. We are driven
towards a synchronous approach because of the overwhelming problem of keeping state location
information up-to-date. At any time, any processor might generate any state, and need to send
it anywhere. We did not wish to deal with the problems an asynchronous method would present
with respect to mapping information. We are also driven towards a \direct" approach to remapping, where a source processor directly sends workload to the processor that ultimately receives it,
without consideration of the communication network topology. This stands in contrast to iterative
approaches where a piece of load may migrate several times in the course of a remapping. We are
pragmatists; the machines we have available to us are moderately sized, and some of them have
very high-bandwidth networks. Furthermore, on shared multiprocessors the subset of nodes (and
the communication topology induced by the subset) upon which the program is executed varies
from run to run. In this context it makes sense to directly compute remappings using globally
disseminated load information, and to eschew any dependence on underlying communication topology. The mechanisms we use to accomplish the global dissemination are standard vector-valued
reduction primitives that have long been found in message-passing libraries such as the Intel nx
library, and MPI [8]. Only experimentation on very large parallel computers will de nitively answer
whether such methods are appropriate on such machines.
We formerly investigated two styles of remapping decision policies. Papers [19] and [20] balance
9

the delay cost of remapping against the performance degradation due to load imbalance and global
blocking at the problem's natural synchronization points. This policy focuses entirely on minimizing
execution time, and we saw no clean way of extending it to encompass memory balance. Policies
that weigh the cost of remapping against the anticipated bene t of remapping over the remaining
lifetime of the computation are investigated in [18]. These policies principally account for the
error that is natural in statistical measurements. On re ection we realized that the decision of
whether to remap is fundamentally a choice of remapping frequency. Graph generation can be a
long-lived computation with load constantly drifting out of balance. The question is not whether to
remap, but how often to remap. Before inventing a policy for automating the remapping frequency
we will experimentally vary it to determine how performance depends upon it. When we do
balance, we attempt to balance nearly perfectly. There is compelling theoretical evidence that
in a computation of this sort we should attempt to redress any rebalance; the model studied in
[17] shows that under workload growth not dissimilar to ours, the maximum memory use by any
processor is stochastically smaller the closer to perfect balance we can achieve with a remapping.
However, as a means of controlling remapping overhead, we will provide mechanisms that back
away from this most aggressive form of balancing, if needed.
Our rst problem is to get processors coordinated so that they periodically remap. This application has no natural synchronization points, unlike many synchronous numerical methods (e.g.
convergence checking). We create synchronization points with real-time clocks. We initially synchronize processors' clocks and have each processor establish a Unix signal generator that causes
a timer interrupt every  units of real time. The interrupt handler merely increments a counter;
the graph generation code tests for changes in this counter before processing any new state (and
in other critical places), and upon detecting a change engages immediately in a parallel reduction
that serves both to synchronize the processors and to distribute load information.  is obviously a
key variable; excessive overhead results from too small a value, while processor idleness may result
from too large a value. Our experiments look at the e ects of changing .
We use a global vector OR reduction to synchronize processors. Each processor o ers a vector
with one component per processor, zero- lled except for the processor's load value placed in its
own component. All processors gain the complete load vector and then exercise the same logic to
determine what to do: send load, receive load, or do nothing. Each processor begins by computing
the average load . Next it rescans the load vector and de nes the on-load group of processors
whose loads are less than ; the o -load group is formed by processors with load greater than .
To determine how to remap each processor sorts the o -load group into decreasing load order,
and sorts the on-load group into increasing load order. Thus the o -load group is arranged with
the most able to contribute load at the front, while the on-load group is arranged with those most
needing of states at the front. The ith member of the sorted on-load group is paired with the ith
member of the sorted o -load group; if the on-load group is larger, we repeat the o -load group as
needed. This is illustrated in Figure 4, an o -load processor may need to distribute load to more
10

repeat...

decreasing load
Off-load group

Processor:
Load:

5
210

2
150

6
122

5
210

2
150

On-load group

Processor:
Load:

0

1

7

4

3

60

73

85

98

102

increasing load

Figure 4: Pairing of source and destination processors
than one other processor.
Each o -load processor attempts to bring every one of its assigned processors up to (but not
beyond) load  without causing its own load to drop below  or without spending too much time in
transferring remapped classes. The latter constraint is enforced by computing at initialization the
maximum number of states that a processor may send within a time-period equal to 0:1  . This
calculation requires some parametric description of message sending costs (and 10% is admittedly
arbitrary), but experience with this check has proven its utility in backing away from performance
crippling remapping schedules.
More speci cally, an o -load processor creates a list of classes available to move, in decreasing
sorted order of load. It then enters a loop within which, at each iteration, attempts to assign
one class to each of its on-load assignments, scanning those on-load assignments in most-needyrst ordering. Each assignment attempt scans the list of available classes from front to back,
choosing the largest as-yet-unassigned class that neither brings the target processor's load above
, decreases the source processor's load below , nor increases the source processor's estimated
remapping overhead beyond the limiting threshold. The loop is left once no further assignments
are possible.
This method is purely heuristic. Since the classes have arbitrary sizes and are indivisible,
the problem of nding any assignment that minimizes the makespan is intractable, let alone an
optimal assignment that minimizes communication. We deliberately make no attempt to maximize
on-processor graph edges, say, by considering inter-class connectivity. Our experience with mapping
has always been that simple techniques usually provide most of the bene t possible, and we saw
that measuring inter-class arcs would impose potentially large additional execution and storage
costs. Very little can be said formally about this method; what can be said is that the method is
fast, and makes every e ort to redress imbalance in those processors that most sorely need more
load.
Further communication is needed to prepare for the class transfers. The number of classes
each o -load group processor will send to each of its targets is established with another vector OR
11

OR-reduced vector describing
the number of classes between
each (Off-load,On-load) pair.

5→0 2→1 6→7 5→4 2→3
3
1
2
2
1

19

111
5→0

46

2
2→1

18

79

6→7

39

21

5→4

8

OR-reduced vector describing
the identity of each class that
will be remapped.

2→3

Figure 5: Global communication of remapping speci cs. Example assumes the pairing of processor
illustrated in Figure 4.
reduction, where o -load processors (possibly repeated as illustrated earlier) ll in their components
of a zero- lled vector with these counts. All processors participate in the reduction, and take note
of these counts as a means of interpreting the next communication, where identities of classes being
transferred are communicated through another vector OR reduction. The length of this second
vector is the total number of classes being transferred; each o -load processor writes into a zero
lled vector the identities of the classes it is about to transfer. Indexing is supported by the
earlier reduction that established transferred class counts, as shown in Figure 5, which assumes the
pairing described in Figure 4. Following the reduction, every processor has the information needed
to appropriately modify the class-to-processor mapping table, and does so. Each o -load processor
then packages the the classes scheduled for transfer, sends them to their scheduled recipients, and
enters a barrier. The recipients accept all scheduled classes, unpack them and create their data
structures, and then enter a barrier. Any other processor simply enters the barrier. Emerging from
the barrier the processors continue their state-generation activity until the next remapping check.

5 Message-Passing Details
Our application's performance is a ected signi cantly by how certain message-passing details are
handled; because of this dependence we describe those details.
It is still the case on most message-passing systems that the startup cost of sending a message
is large compared with the per-byte transfer cost. State vector lengths tend to be small (< 100
components), at least for systems one can exhaustively analyze. A processor therefore dedicates
an output bu er for each other processor. As states are generated and \sent", they are in fact just
bu ered until the bu er becomes full, at which point the set of states is nally sent. A processor
with no further states to explore will preemptively ush its bu ers. We use 3000 byte message
bu ers.
Our application runs under MPI, which takes care of all low-level message-passing logic. When
a message arrives o the network, MPI either routes it to a memory block that an application
declared for it, or stores it internally until the message is sought. Deadlock can result if messages
are received faster than the application consumes them. It is necessary to implement ow control
12

at the application level, and to be sure that the application looks for new messages suciently
often. Our code implements ow-control with acknowledgements. It allows a processor to send
up to U unacknowledged state messages to each target processor; if it reaches this threshold, the
application will not continue processing workload until an appropriate acknowledgement comes in.
A state message is acknowledged as soon as it is received. The experiments we report use U = 10.
A counter is kept in the state exploration loop, enumerating the explored states. Every fourth
state, the application calls a routine that probes MPI for any messages, and processes all messages
so identi ed. Relatively frequent absorption of MPI-bu ered messages is needed to keep senders
from stalling at the ow-control threshold.

6 Experiments
These experiments are designed to reveal how well our automated state-to-processor mapping performs relative to our previous hand-tuned hash-function, and to determine how various aspects of
the new method contribute to the overall performance. The experiments were conducted on the
IBM SP-2, at NASA Langley Research Center. A shared multiprocessor, to run a job one submits
a request for a dedicated subset of nodes with a speci ed size. The processors of the subset are
dedicated to the application while it runs, but application communication trac may contend with
that of other applications.
For the purposes of comparison with our earlier approach, we will study a model of a Flexible
Manufacturing System, illustrated in Figure 6, originally discussed in [3]. A key parameter to this
model is the number of tokens k initially placed in places P1, P2, and P3; increasingly larger statespaces are generated by increasing k. This Petri net has \timed" transitions (white boxes) and
\immediate" transitions (black boxes). The state-space generator eliminates all vanishing markings,
those with one or more immediate transitions enabled. When generating children caused by a timed
transition ring, one or more immediate transitions may be enabled, and these will be red (and
any further immediate transitions then enabled will be red, as so on) until all non-vanishing
descendents are discovered.
The performance we report is speedup, relative to a speci cally serial version. For k = 5,
there are 152,712 states generated, with 1,111,483 edges between them in the state-space. A highly
optimized serial code requires 381 seconds of processing time on one SP-2 node.
One point of interest is a comparison between the optimized serial version and the new code
running on one processor. We nd that the new code runs at 85% the speed of the serial code.
In modifying the code to support remapping, we had to change the serial's code techniques for
minimizing malloc overhead. The code currently is unoptimized with respect to dynamic memory
overhead, we suspect this is the principal cause of the di erence in performance. Whatever the
cause, an implication is that our 16 and 8 processor runs can achieve speedups of at best 13.6, and
6.8, respectively.
13

tP1s
#(P1s)
P1

P1wM1

#(P1s)
P1d

P1M1

k
tM1

tP1

M1

#(P12s)
P12s
tP12s

P1s

tP1e

tP1M1

P1wP2

3
tP1j

P12M3

P12wM3

P12
tx

#(P12s)
tP12M3

tM3

tP12
tP2j

2 M3

#(P12s)
P2wM2

P2M2

P2wP1

P2d

k
P2

P2s
tP2
tP2s

#(P2s)
P3

tP2M2

tM2
M2

1

P3M2

tP2e
#(P2s)

P3s
#(P3s)

k
tP3

tP3M2

tP3s
#(P3s)

Figure 6: Petri Net Model of a Flexible Manufacturing System. White transitions are timed, black
transitions are immediate. Expressions in places denote the number of tokens in the initial marking,
labels on places and transitions are used for description, and labels on arcs describe the number of
tokens transferred by the ring of the connected transition.
The control classes sampled ranged from 800 to 900 states, around one half of one percent
of the total state-space size. Less than a second was required to sample and combine this set of
states. We also measured locality among states within a class; slightly over 50% of all edges in
the state space are between states in the same class. An analytic explanation for locality is given
in Section 7. The code we run is non-deterministic, as its behavior depends on system snapshots
measured at intervals induced by the interrupt timers, as well as randomization. We have found
that variance due to randomization is surprising small (in cases where no remapping occurs), but
that variance due to di erent system states when remapping occurs have a more pronounced a ect
on performance. The data we present for any set of parameters is based on ve runs, reset with the
same random number generator seed. Di erent means of reporting on those runs will be described
when we look at di erent experiments.
We wished rst to determine how well the automated state mapping works compared with the
previous hash-function approach. On the best of the hash functions we previously investigated, the
new method (with no remapping) is signi cantly faster (as much as 20%) than the old method.
On another of the previously studied hash functions (not a strawman, it seemed plausible that it
would be good) the new method is over ve times faster on 16 processors. Thus it seems that the
randomized class generation method coupled with a cyclic mapping of classes to processors is every
14

0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80 90 100
# States in Class

Fraction of classes with smaller size

Fraction of classes with smaller size

CDF of Class Size (in States), Sizes 1-100
1

CDF of Class Size (in States)
1
0.8
0.6
0.4
0.2
0
0

1000 2000 3000 4000 5000 6000
# States in Class

Figure 7: Cumulative distribution function of class size
bit as e ective as a smart hash function. But, as the speedup achieved is only 7 (out of 16) there is
clearly room for much improvement2. We hope to recover that potential using dynamic remapping.
We are interested in the characteristics of classes formed by randomization. Figure 7 plots the
cumulative distribution function of class size (in numbers of states). There are 789 non-empty
classes overall. For clarity, we present the data in two sections, the rst for class sizes 1-100, the
second for all sizes. We see signi cant variation: 70% of classes have sizes smaller than 100 states,
10% of them have sizes greater than 1000.
The dynamic remapping experiments involve two notions of load. \Active" load measures the
number of unexplored states, which represent future execution workload. \Memory" load measures
the number of all extant states. We will balance active load in attempts to minimize execution
time at the possible expense of memory balance. We will balance memory load in attempts to
evenly balance memory utilization at the possible expense of computational balance. Figure 8
presents the speedup curves for both load de nitions, as a function of remapping \period" (time
between remappings), expressed as a percentage of the execution time without remapping. Hence,
the rightmost endpoints represent the case with no remapping. In real time units, the remapping
periods sampled were 1; 3; 5; 9; 11; 13; 21; 31, and 1 seconds. The error bars illustrate the highest
and lowest speedups measured among all 5 samples; the lines pass through the averages. The span
of the error bars illustrate the variability between runs with identical parameters. First we look at
performance when balancing is performed on the memory load. We see that

 performance at the highest remapping frequency is worse than less aggressive remapping

frequencies. This is due to shorter periods of execution time over which the remapping
overhead is amortized, and may also be due to whatever synchronization \skew" exists by
synchronization on distributed real-time clocks.

The speedup of the implementation based on a hash-function is somewhat smaller than reported in our previous
work, owing to new optimizations in the serial version that signi cantly reduced its cost.
2

15

Speedup

Speedup

Speedup, Memory Load, 8 and 16 Processors
16
14
12
10
8
6
4
2
0
0 10 20 30 40 50 60 70 80 90 100
Remapping Period (% unmapped execution time)

Speedup, Active Load, 8 and 16 Processors
16
14
12
10
8
6
4
2
0
0 10 20 30 40 50 60 70 80 90 100
Remapping Period (% unmapped execution time)

Figure 8: Speedups

 There is a signi cant range of frequencies over which performance is relatively constant.
 For 16 processors, performance drops o when there are fewer than 3 remappings.
 For 16 processors, there is a signi cant performance bene t to remapping.
 Performance for 8 processors is relatively insensitive to the remapping period, and there is
small bene t to remapping.

 For both 8 and 16 processors, attainable speedup is relatively close to the maximum possible
(6.8 and 13.6).

For performance when balancing active load we draw most of the same conclusions, although the
range of frequencies over which performance is relatively constant is smaller. Also, interestingly, we
see that there is little to distinguish between balancing memory or active load; the latter is slightly
faster.
Another view of the bene t of remapping is obtained by considering how computational demands
vary as the computation progresses. A rst approximation to a processor's computational workload
is the number of states it explores between remappings. \Utilization" is then the ratio of average
number of explored states on a processor between remappings to the maximum number explored
in that period. This is admittedly inexact, as it does not account for time a processor spends
searching for and inserting remotely generated states into its data structures. Figure 9 illustrates
how utilization varies for the two load de nitions, for the original hashing method (on the best hash
function), and for our class-based method with remapping disabled, with sampling every 3 seconds.
Here we clearly see that remapping keeps utilizations high, as long as there is ample workload.
Utilizations tail o as the nal pieces of the state-space graph are found. By contrast, utilizations
in the static methods degrade gradually, as processors exhaust their unexplored state lists and must
passively await receipt of an unexplored state from another processor before generating further
16

Processor Utilization
1

0.8

Utilization

0.6

0.4

Active Load
Memory Load
Hash
NoRemap

0.2

0
0

5

10

15

20

25
30
35
Real Time (seconds)

40

45

50

55

Figure 9: Processor utilizations, 16 processors, two remapping methods and the original hashing
method.
states. It is interesting to note that the class-based static method tracks the dynamic methods
early in the computation, but then falls away quickly.
A clearer di erence between balancing active and memory load is seen when we examine variability in memory usage. Since balancing memory load gives us greater control over memory balance
we expect its memory utilization to be better. Memory balance varies as the computation progresses. While the terminal memory balance is most important in some contexts, in others we are
concerned with dynamics. For instance, when the problem is very large relative to the number of
processors, one or more processors can reach their memory limitations and so degrade performance
until remapping establishes a balance that frees them to continue processing. Even before this,
excessive memory use can force a processor's virtual memory system to thrash. The better that
balance can be maintained while processing, the smaller the risk of pushing a processor against its
memory limits.
Our implementation saves the number of states in a processor at every remapping timer interrupt. For each such instant we will describe the global memory state with the unitless ratio
(smax ; smin )=savg , where smax is the maximum number of states in any processor, smin is the
minimum, and savg is the average. For each time instant we plot the average of this statistic taken
over 5 runs. Figure 10 displays this data, comparing balancing memory and active loads for remapping intervals of 1, 3, and 5 seconds. Data for both 8 and 16 processor runs is given. These plots
show a clear distinction between the two notions of load. With a remapping interval of one second
the two methods execute at the same rate, yet the memory balance is very noticeably di erent.
17

Variation
Variation

Remapping Interval = 3 sec, 16 Processors
2
1.8
Active Load
Memory Load
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25 30 35 40
Real Time (seconds)
Remapping Interval = 5 sec, 16 Processors
2
1.8
Active Load
Memory Load
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
5
10
15
20
25
30
35
Real Time (seconds)

Variation

Variation
Variation
Variation

Remapping Interval = 1 sec, 16 Processors
2
1.8
Active Load
Memory Load
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25 30 35 40 45
Real Time (seconds)

Remapping Interval = 1 sec, 8 Processors
2
1.8
Active Load
Memory Load
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80
Real Time (seconds)
Remapping Interval = 3 sec, 8 Processors
2
1.8
Active Load
Memory Load
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70
Real Time (seconds)
Remapping Interval = 5 sec, 8 Processors
2
1.8
Active Load
Memory Load
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70
Real Time (seconds)

Figure 10: Variation in memory utilization, as a function of time. Variation is measured as average
ratio of the di erence between maximum and minimum memory usage to average memory usage.
It is notable that in all plots the variation starts high and is corrected, but then tends to grow.
Balancing memory load the growth is due to the built-in constraint against spending too much
time remapping. Balancing active load it is due to the fact that some imbalance simply cannot be
corrected|any class without unexplored states is ineligible for remapping. The data also suggests
that the methods di er less as the remapping frequency decreases.
18

Total Classes Remapped

0.15

1400

Active Load
Memory Load
Total # Classes

Fraction of Execution Time

Total Remapping Overhead
0.2

0.1
0.05

1200

Active Load
Memory Load

1000
800
600
400
200

0

0

0 10 20 30 40 50 60 70 80 90
Remapping Period (% overall execution time)

0 10 20 30 40 50 60 70 80 90
Remapping Period (% overall execution time)

Figure 11: Fraction of execution time spent remapping, and total number of classes remapped. The
remapping period is expressed as a percentage of the overall execution time.
A nal view of remapping behavior is obtained by looking at statistics gathered at remapping
epochs. We measure the total amount of time spent synchronizing for, computing, and conducting
remapping; we also measure the total number of classes remapped (counting each move a class
makes individually).
Figure 11 plots this data for the two load de nitions, as a function of the remapping period,
expressed as the percentage of total execution time. We see that under the highest frequency of
remapping, balancing unexplored states exacts a very high overhead, 20% of the total execution
time (our check against spending more than 10% time in remapping overhead is both inexact, and
applied only to transfer costs). We also see very clearly that (surprisingly) balancing active load
tends to have higher overhead, while balancing memory load moves far more classes! Two factors
may explain these observations. First, this data says nothing about the sizes of the classes being
moved (and we did not measure this). It may be that classes moved when balancing unexplored
states tend to be much larger. A second factor is the cost of computing which states will be moved.
The heart of the mapping algorithm searches through classes looking for the largest movable ones
that satisfy certain constraints. Since the set of classes with unexplored states is smaller, it may
be that much more searching is required. Both factors are mere conjectures.

7 Analysis
In this section we provide a simple analysis that explains locality between states in a common class,
and that looks at the scalability of remapping in the global fashion we have used.
When a state is classi ed by searching for it in the control class, some random number C of its
components will determine how the state compares with states in the control set. C need not be
the depth of the tree; for example the rst component tested may indicate that the state is less
than the tree root, and the same component may indicate that it is greater than the tree's left child.
So, if we choose a state s from some class at random, C describes the number of its components
19

that were involved in classifying the state. Now consider any child of s. The child di ers from s in
some random number of components D. If that set of components is entirely disjoint from the set
of components used to classify the parent, then the child and parent are assured to be in the same
class. Note that the reverse is not true, that is, states might di er in some (or even all) of the C
components and still be in the same class, as previously observed for class 1 in Figure 2.
The set of components used to classify s and the set of components in which parent and child
di er are independent, since the components used in the lexicographical ordering for comparison
are randomized. If the state vector has n components, the probability ps that the child is in the
same class as the parent is at least

ps  E
> E

"D;1 
#
Y n ; C ; i

n

"i=0

#

n ; C D :
n

These expressions come from thinking of the sequence of selecting D components that distinguish
parent and child, without replacement from the vector of n components, each selection needing to
avoid the C components that classify the parent. Further bounding is possible by recognizing that
n;c is convex in c, so that by closure ( n;c )d is convex in c for any d  1. We may write
n
n

ps > E
=

"

n
X
d=1

n ; C D
n

#

PrfD = dgE

"

#

n ; C d :
n

Jensen's Inequality [22] states that if g (x) is convex,
then for any random variable where E [X ]

d   n;E [C ] d
exists, E [g (x)]  g (E [X ]); thus for every d, E n;nC  n
. Therefore
d
n
;
E
[
C
]
ps >
PrfD = dg
n
d=1"
#

D
= E n ; nE [C ]
E [D]

n
;
E
[
C
]

n


n
X

where the last step follows from another application of Jensen's Inequality. This last expression
gives us a way of considering how problem characteristics n, E [C ], and E [D] a ect locality. For
the speci c problem used as illustration in this paper, n = 22 and we instrumented the program
to estimate E [C ] = 4:2 and E [D] = 5:1. This yields ps > 0:34, as compared with the measured
ps  0:51. The bound evidently is not tight, but nevertheless is useful in describing the mechanics
of locality and bounding locality from below.
20

We now turn to a simple analysis of the costs of computing a global view of load. Like many
before us we will model the cost of communicating one message as a constant startup cost, plus
a length-dependent transmission cost. The reductions we employ involve log P stages of messagepassing, (P is the number of processors) each stage separated by computation that performs
the vector-length reduction locally. The cost of computing the global reduction is modeled as
(log P )( + ( + )P ), where is the startup cost, the per-integer transmission cost, and the
per-integer computation cost. We will view these constants in units of machine cycles. includes
all MPI overhead and messaging protocol logic; a reasonable order of magnitude is = 104. On
modern high-bandwidth networks is small, say, = 1. Computation costs re ected by include
copying into and out of memory, and comparison. = 10 is reasonable.
The cost of computing a global snapshot increases as a function of P log P , we are interested in
determining for what magnitude of P this cost is excessive. We do this by comparing the lengthindependent and length-dependent costs. For a given fraction f we can compute the value of P for
which the ratio of length-independent and length-dependent costs is f :
P
f = (log P )( log
+ ( + )P )
1  :
=
1+P +
Solved for P we obtain



P = f1 ; 1
+



:

(1)

To assess the relative cost of computing a global viewpoint, we ask ourselves what degree of dependency on length we can tolerate. This consideration should include the amortization of the
remapping cost over computational periods between remaps. For instance, if an order of magnitude di erence is acceptable we would use f = 0:1. Using the estimates of , , and given earlier,
on the order of 104 processors could be tolerated with this approach. Considering today's technology, there is clearly room for two orders of magnitude reduction in and still have computation of
a global viewpoint make sense. We conclude from this analysis that remapping based on a global
view should scale to machines available in the foreseeable future, and when tens of thousands of
processors are routinely used we can reassess the merits of the approach.

8 Conclusions
The generation of large discrete state-spaces is a computationally intensive activity with extreme
memory demands, highly irregular behavior, and poor locality of reference. As the generation of
large state-spaces causes virtual memory thrashing on single processor systems, we are lead to
consider exploiting the larger memory space available in parallel or distributed systems.
State-space generation is essentially a breadth- rst generation and traversal of the reachability
graph that is implicit in the model semantics. The critical problem is determining whether a newly
21

generated state has been generated before. In a serial implementation this question is answered
by organizing known states in a search tree, and looking for the new state in that tree. As this is
a centralized activity, any parallel or distributed solution must nd an alternative approach. The
method we pursue is to assign states to processors. After a state has been generated, it is sent to
its assigned location, where a local search tree determines whether the state already exists. Our
solution entails automated classi cation of states into classes, and dynamic mapping and remapping
of classes to processors.
Our state classi cation system is based on randomization: states visited in a set of initial random
walks are organized in a search tree, using a randomized lexicographical comparison function. All
states that exit the search at the same point are in the same class. We provide analysis of the
method that lower bounds the probability that a child is in the same class as its parent, the bound
is expressed in terms of easily understood model characteristics.
Our remapping scheme uses a real-time clock to coordinate all processors periodically. When
processes coordinate, they use a global vector OR reduction to distribute a snapshot of the global
load state. Each processor uses the snap-shot to determine whether to send load, receive load, or
do nothing. Further vector reductions establish the number and identity of the classes involved in
the balancing. The processors involved in the exchange implement it, and all processors modify the
global class-to-processor mapping table. We analyze the cost of a global snapshot to argue that it
scales to system sizes much larger than those we anticipate would be commonly used for years to
come.
We report on data collected on solution of a system model discussed earlier in the literature.
We considered two variations that di er in the notion of \load". One method balances extant
classes without explicit concern for balancing future state generation workload. The other method
balances states that have yet to be explored, without explicit concern for balancing existing states.
We nd that both methods execute signi cantly faster and achieve better memory balance than
static methods. However, while we nd that the method that balances memory does indeed achieve
better memory balance, it runs almost as fast as the method that balances future workload. For both
methods, remapping too often (e.g. every second) degrades performance due to excessive overhead;
backing o somewhat amortizes remapping overhead but stays responsive to load variation. Our
data suggests that remapping frequently and su ering greater remapping overhead is to be preferred
to letting the computation get widely out of balance and then attempt to correct it. To the extent
that we can generalize from our experiments, it would seem that we should remap as frequently as
possible, relative to the speed of the communication medium. We expect that a longer remapping
period is needed on an ethernet-connected network of workstations, for instance, to avoid undue
communication overhead.
The fundamental message is that for distributed or parallel discrete state space generation,
remapping is essential. We have demonstrated techniques that prove its feasibility and demonstrate its potential. Key elements to our success were automated and ecient state classi cation,
22

and low-overhead methods for computing and using global snapshots of load state to compute
remappings. Access constraints to the SP-2 forced us to focus on a relatively small model instance
when conducting our parametric study. In fact, pilot runs on larger models demonstrate even more
signi cant performance gains due to remapping, in one case we observed a factor of 5 di erence
when using 32 processors. The point is clear, remapping is essential, and is feasible.

Acknowledgements
The code we worked with has been modi ed several times, by several people. The original distributed state-space generation code was authored largely by Josh Gluckman. Jeremy Gottlieb
ported this code to use MPI on the IBM SP-2. Barry Lawson implemented most of the class-based
data structures and book-keeping. To these tireless workers we o er our thanks.

References
[1] G.E. Blelloch. Scans as primitive parallel operations. IEEE Trans. on Computers, 38(11):1526{
1538, November 1989.
[2] G. Ciardo, J. Gluckman, and D. Nicol. Distributed state-space generation of discrete-state
stochastic models. INFORMS Journal on Computing, 1997. To appear.
[3] G. Ciardo and K. Trivedi. A decomposition approach for stochastic reward net models. Performance Evaluation, 18(1):37{59, 1993.
[4] G. Ciardo, K. Trivedi, and J. Muppala. SPNP: Stochastic Petri net package. In Proceedings of
the 3rd Int. Workshop on Petri Nets and Performance Models, pages 142{151, Kyoto, Japan,
December 1989. IEEE Press.
[5] G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. Journal of
Parallel and Distributed Computing, 7(2):279{301, October 1989.
[6] D.L. Eager, E.D. Lazowska, and J. Zahorjan. Adaptive load sharing in homogeneous distributed systems. IEEE Trans. on Software Engineering, SE-12(5):662{675, 1986.
[7] D. Gerogiannis and S. Orphanoudakis. Load balancing requirements in parallel implementations of image feature extraction tasks. IEEE Trans. on Parallel and Distributed Systems,
4(9):994{1013, 1993.
[8] Gropp, Lusk, and Skjellum. Using MPI. MIT Press, Cambridge, Mass, 1994.
[9] E. Horowitz and S. Sahni. Fundamentals of Data Structures. Computer Science Press, Potomac, Maryland, 1978.
23

[10] L. Kale. Comparing the performance of two dynamic load distribution methods. In Proceedings
of the 1988 Int. Conference on Parallel Processing, pages 8{12, 1988.
[11] R. Karp and Y. Zhang. A randomized parallel branch-and-bound procedure. In Proceedings
of the 20th Annual ACM Symp. on Theory of Computing, pages 290{300, 1988.
[12] V. Kumar, A. Grama, and N. Vempaty. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22:60{79, 1994.
[13] F. Lin and R. Keller. The gradient model load balancing method. IEEE Trans. on Software
Engineering, SE-13(1), January 1987.
[14] R. Luling and B. Monien. Load balancing for distributed branch-and-bound. In Proceedings
of 6th Int. Parallel Processing Symposium, pages 543{548, 1992.
[15] D. Nicol. Communication ecient global load balancing. In Proceedings of the 1992 Scalable
High Performance Computing Conference. IEEE Press, April 1992.
[16] D. Nicol. Non-committal barrier synchronization. Parallel Computing, 21:529{549, 1995.
[17] D. Nicol, R. Simha, and D. Towsley. Static assignment of stochastic tasks using majorization.
IEEE Trans. on Computers, 45(6):730{740, 1996.
[18] D.M. Nicol and P.F Reynolds, Jr. Optimal dynamic remapping of data parallel computations.
IEEE Trans. on Computers, 39(2):206{219, February 1990.
[19] D.M. Nicol and J.H. Saltz. Dynamic remapping of parallel computations with varying resource
demands. IEEE Trans. on Computers, 37(9):1073{1087, September 1988.
[20] D.M. Nicol, J.H. Saltz, and J. Townsend. Delay point schedules for irregular parallel computations. Int'l Journal of Parallel Programming, 18(1):69{90, February 1989.
[21] N. Rao and V. Kumar. Parallel depth- rst-search, part I: Implementation. Int. Journal of
Parallel Programming, 16(6):479{499, 1987.
[22] H.S. Ross. Stochastic Processes. Wiley, New York, 1983.
[23] J. A. Stankovic. An application of bayesian decision theory to decentralized control of job
scheduling. IEEE Trans. on Computers, C-34(2):117{130, February 1985.
[24] M. Willebeek-LeMair and A. Reeves. Strategies for dynamic load balancing on highly parallel
computers. IEEE Trans. on Parallel and Distributed Systems, 4(9):979{993, 1993.
[25] C.-Z. Xu and F. Lau. Analysis of the generalized dimension exchange method for dynamic
load balancing. Journal of Parallel and Distributed Computing, 16(4):385{393, 1992.
24

[26] C.-Z. Xu and F. Lau. Iterative dynamic load balancing in multicomputers. Journal of Operational Research Society, 45(7):786{796, July 1994.

25

