Static Scheduling for Barrier MIMD Architectures by Zaafrani, Abderrazek et al.
Purdue University
Purdue e-Pubs
Department of Electrical and Computer
Engineering Technical Reports
Department of Electrical and Computer
Engineering
1-1-1990







Follow this and additional works at: https://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Zaafrani, Abderrazek; Dietz, Henry G.; and O'Keefe, Matthew T., "Static Scheduling for Barrier MIMD Architectures" (1990).
Department of Electrical and Computer Engineering Technical Reports. Paper 702.
https://docs.lib.purdue.edu/ecetr/702
Static Scheduling for 
Barrier MIMD Architectures
Abderrazek Zaafrani 




School of Electrical Engineering
Purdue University
West Lafayette, Indiana 47907
Static Scheduling 
for Barrier MIMD Architectures
Abderrazek Zaafrani, Henry G. Dietz, and Matthew T. O'Keefe
School of Electrical Engineering 
Purdue University 




Barrier MIMDs are asynchronous Multiple Instruction stream Multiple Data 
stream architectures capable of parallel execution of variable-execution-time instruc­
tions and arbitrary control flow (e.g., w h ile  loops and calls); however, they differ 
from conventional MIMDs in that the need for run-time synchronization is significantly 
reduced. Whenever a group of processors within a barrier MIMD encounters a syn­
chronization point (barrier), static timing constraints become precise, hence, conceptual 
synchronizations between the processors often can be statically resolved with zero cost 
— as in a SIMD or VLIW and using similar compiler technology. Unlike these 
machines, however, as execution continues past the synchronization point the accuracy 
within which the compiler can track the relative timing between processors is reduced. 
Where this imprecision becomes too large, the compiler simply inserts a synchroniza­
tion barrier to insure that timing imprecision at that point is zero, and again employs 
static, implicit synchronization.
This paper describes new scheduling and barrier placement algorithms for barrier 
MIMDs that are based loosely on the list scheduling approach employed for VLIWs 
[Elli85]. In addition, the experimental results from scheduling more than 3500 syn­
thetic benchmark programs for a parameterized barrier MIMD machine are presented.
Keywords: Static Barrier MIMD (SBM), Dynamic Barrier MIMD (DBM), barrier syn­
chronization, code scheduling, compiler optimization.
Static Scheduling
I. Introduction
Runtime synchronization overhead is a critical factor in achieving high speedup using parallel com­
puters. A key advantage of SIMD (Single Instruction stream, Multiple Data stream) architectures is that 
synchronization is effected statically at compile-time, hence the execution-time cost of synchronization 
between “processes” is essentially zero. VLIW (Very Long Instruction Word) [Elli85], [CNOPR88] 
machines are successful in large part because they preserve this property while providing more flexibility 
in terms of the operations that can be parallelized. Unfortunately, VLIWs cannot tolerate any asynchrony 
in their operation; hence, they are incapable of parallel execution of multiple flow-paths, subprogram 
calls, and variable-execution-time instructions. In a recent paper [DSOZ89], a new architecture was pro­
posed that extended the static synchronization properties of the SIMD and VLIW class of parallel 
machines into the MIMD domain. The new architecture is called the Static Barrier MlMD, or SBM.
In this paper, we describe scheduling and barrier placement algorithms for barrier MIMDs, as well 
as extensive results from scheduling synthetic benchmarks using the new algorithms. An SBM is a 
MIMD computer which has specialized hardware implementing a new type of barrier synchronization 
which allows the compiler to perform static scheduling. If an SBM barrier is placed across a set of 
processes, then no process can execute past that barrier until all have reached the barrier. Unlike other 
barrier mechanisms, all processes will resume execution in exact synchrony1. Hence, immediately after 
executing an SBM barrier, the machine can be treated as a VLIW, using static scheduling to eliminate the 
need for further runtime synchronization.
However, VLIW machines do not allow MIMD code structures (e.g., multiple flow paths) nor even 
variable-time instructions. In an SBM, static scheduling tracks both minimum and maximum completion 
times for each processes’ code; runtime synchronization is needed iff the minimum time for the consumer 
of an object is less than the maximum time for that object’s producer. If the timing constraints cannot be 
met statically, this implies that the static timing information has become too “ fuzzy.” Inserting another 
barrier effectively reduces this fuzziness to zero. Based on the static scheduling work described in this 
paper, more than 77% of all synchronizations which would occur in execution on a conventional MIMD 
will be accomplished without runtime synchronization in a barrier MIMD.
When barrier MIMD architectures were originally proposed [DiSc88], [DSOZ89], an algorithm for 
inserting barriers while scheduling code was given. The algorithm attempted to minimize the number of 
barriers required by using the static timing constraints inherent in the barrier synchronization operation. 
In these previous papers, no implementation of the barrier insertion algorithm was attempted nor was 
there any algorithm proposed for the actual code scheduling [DSOZ89]. This work contains three new 
results: a code scheduling algorithm for barrier MIMDs, an “ optimal” barrier insertion algorithm, and 
extensive scheduling experiments on synthetic benchmarks using the new algorithms.
I. Machines with this constraint will be called barrier MIMDs in this paper.
Page 2
The paper is organized as follows. Section two describes the structure of the synthetic benchmark 
programs, while section three explains the principles of operation for a barrier MIMD. Section four gives 
the scheduling and barrier insertion algorithms, and is followed by the description of the scheduling 
experiments in section five. Section six provides a comparison between VLIW and SBM performance for 
the synthetic benchmarks used in this paper. Finally, section seven gives conclusions and describes 
current research efforts.
2. Structure of the Synthetic Benchmark Programs
This study focuses on fine-grain scheduling of a single-chip multiprocessor RISC node [DSC089] 
that employs the barrier mechanism discussed in this paper. Expensive operations such as multiplication 
and division are implemented as data-dependent code sequences that introduce asynchrony into the chip 
operation. Memory accesses across a shared bus or interconnection network involve contention that also 
involves stochastic delays. It is shown that static scheduling may still be used to advantage within this 
framework.
In this work, we wished to characterize and study the extent to which static scheduling can be 
employed in barrier MIMDs. In particular, measurements of the number of synchronizations that are 
satisfied statically, at compile-time, versus the number that require explicit synchronization instructions 
executed at run-time were desired. To this end, a compiler was developed for a simple language consist­
ing of basic blocks of code with no control flow constructs.
The programs to be scheduled on barrier machines were automatically generated using common 
instruction execution frequencies [AlWo75]. This allowed us to automatically generate a very large 
number of synthetic benchmarks from which summary statistics were obtained. It also made it quite sim­
ple to change the various characteristics of generated programs to observe the effects on the statistics of 
scheduled programs. TTie drawback, of course, is that it is not possible to take real benchmark programs 
directly as input. Current efforts include prototype compiler development for generating code for barrier 
MIMDs from a standard set of programming language constructs, including control structures, array and 
pointer data structures, and subroutine calls [OKee90]. We view the scheduling results for the synthetic 
benchmarks as conservative in terms of comparing the performance of VLIWs and the SBM since it is 
precisely for such programming constructs that the SBM is superior.
2.1. Benchmark Instruction Set
The scheduling algorithm takes as input a basic block of instructions. A basic block is a region of 
code that contains a sequence of consecutive statements. This region should have a single entry point and 
no embedded control structures [AhSU86]. There are nine instructions generated from the synthetic code 
sequences in the instruction set: four of these nine instructions have variable execution time. The 




Instruction Execution Freq. Min. Time Max. Time
Load — I 4
S to re — I I
Add 45.8% I I
Sub 33.9% I I
And 8.8% I I
Or 5.2% I I
Mul 2.9% 16 24
D iv 2.2% 24 32
Mod 1.2% 24 32
TableltlnstructionFrequenciesandExecutionTimeRanges
Load Execution time varies from one to four time units. In a shared-bus multiprocessor, this
difference is mainly due to different access times between local cache and main memory. 
Typically, an access to the main memory is anywhere from four to twenty times longer 
• than an access to cache. A more pronounced difference between local and non-local 
memory is often found in multiprocessors that require that non-local accesses to go 
through single-or multistage-interconnection networks.
Mul Execution time varies from 16 to 24 units of time. This assumes that the multiplier opera­
tion is either implemented using shift and add instructions, or in an asynchronous hardware 
design. In either case, execution time is variable to take advantage of data-dependent 
optimizations. Synchronous designs with constant execution time are possible, but require 
more hardware, typically in the form of pipelines. Since multiplication is a commonly exe­
cuted instruction, this additional hardware sometimes can be justified.
D iv  Execution time varies from 24 and 32 time units, for much the same reason as multiply.
Asynchronous designs are much more common for division, as the operation is inherently 
harder to pipeline and the additional hardware is not justified by its typically low execution 
frequency.
Mod The execution time of the modulus operation Mod varies between 24 and 32 time units,
for the same reasons as division.
For the other operations, it is realistic to assume that they have a constant execution time of one unit.
These operations are Or, And, Add, Sub, and S to re . Table I summarizes the instruction frequen­
cies and execution time ranges.
Page 4
WStatic Scheduling
Tuple No. Instruction Min. Time Max. Time
0 Load i I 4
I Load a I 4
2 Add 0 ,1 2 5
3 S to r e  b ,2 3 6
4 Load f I 4
24 Load d I 4
5 Load j I 4
12 Load c I 4
26 And 4 ,2 4 2 5
6 Add 4 ,5 2 5
30 Sub 2 6 ,4 3 6
18 Sub 6 ,0 3 6
22 Add 1 ,2 2 5
38 Add 12 ,30 4 7
19 S to r e  i , 18 4 7
23 S to r e  a , 22 3 6
27 S to r e  h , 26 3 6
31 S to r e  e , 30 4 7
39 S to r e  g ,3 8 5 8
Figure I: Instructions from Example Synthetic Benchmark 
22. Benchmark Synthesis
A C program was developed to randomly generate the basic blocks according to the statistics 
described below. This program requires as input the number of statements, variables, and constants 
desired in the generated code. It then generates a random sequence of assignment statements satisfying 
the desired conditions. The frequency of the assignment statements corresponds loosely to the instruction 
frequency distributions found in [AlWo75]. Note that in table I the frequencies of load and store are not 
given. These instructions are provided as necessary during code generation and optimization: the first 
reference to a variable causes a load for that variable to be generated, and a store is generated when a 
variable is assigned a value.
During code generation, the randomly-generated assignment statements are optimized using stan­
dard local optimizations, including common subexpression elimination, constant folding and value propa­
gation, and dead code elimination [AhSU86]. Hence, the resulting synthetic benchmark does not contain 
“ redundant” parallelism that might skew the results.
Page 5
Static Scheduling
© © ©  ©  ©
Figure 2: Instruction DAG for Example Synthetic Benchmark.
An example synthetic benchmark is shown in figure I, and its corresponding DAG (directed acyclic 
graph) is shown in figure 2. In figure I, the leftmost column represents the tuple number. Each tuple is 
incrementally assigned a number as it is generated by the code generator. Many tuples are not 
represented because they were removed by the optimizer. The two rightmost columns represent the 
minimum finish time and maximum finish time on an infinite number of processors. This columns will 
help in ordering the tuples as it will be explained in section 4. In the instruction DAG the instructions are 
represented as nodes and edges represent the precedence constraints between instructions. This DAG is 
important to both the code optimizations and scheduling algorithms.
Parameters for both the generated code and the scheduling algorithms can be varied. These machine 
size for the scheduling algorithm can be varied from 2 to 128 processing elements. The parameters for the 
random sequences of assignment statements include the number of variables and statements. The number 
of variables corresponds roughly to the parallelism width of the generated benchmark after optimization. 
For a fixed number of processors, the number of variables can be varied from 2 to 15 variables. For a 
fixed number of processors and a fixed number of variables, the number of statements can be varied from 
5 to 60 statements. The larger basic blocks sizes approximate a long instruction trace found in VLIW 
scheduling [Elli85].
3. Barrier MIMD Principles of Operation
Figures 3 through 8 are used to illustrate the basic scheduling concepts behind barrier MIMD archi­
tectures. Figure 3 depicts the use of conventional directed synchronization to insure that the producer 
executes before the consumer. Here, at run-time, a synchronization object is transmitted from the
Page 6
Static Scheduling
producer to the consumer. This transmission could take a potentially unbounded amount of time depen­
dent on routing and traffic through a network; hence, the only timing information available at compile­





Figure 3 Figure 4
Instead, suppose that compiler analysis attempts to precisely track the minimum and maximum 
times at which the producer and consumer would execute without using any runtime synchronization. As 
shown in figure 4, if the minimum consumer time is greater than the maximum producer time, then no 
runtime synchronization is required. If not, as in figure 5, then it is necessary to insert a barrier to impose 
timing constraints which will be known to satisfy the producer/consumer relationship. This is shown in 
figure 6.
Proc. 0 Proc. I
Producer
Consumer






X * T T1T
Figure 5 Figure 6
Shaffer [Shaf89] and others have applied a transitive reduction [AhHU74] to task graphs to remove 
redundant synchronization in code executing on MIMD architectures. Callahan [Call87] proposed a simi­
lar method for reducing the number of (conventional) barrier synchronizations required in scheduling 
nested loop constructs. However, these techniques will only remove synchronizations based on graph 
structure, rather than on knowledge of minimum and maximum execution time bounds as we propose. In 
addition to removing synchronizations based on the structure of the task graph, we can safely remove 















Figure 7 Figure 8
There is also a secondary effect unique to our scheduling for the proposed barrier architectures. 
Typically, there will be a sequence of several producers and consumers in the code being analyzed and 
scheduled, as shown in figure 7. Although at first it seems that each such producer/consumer pair will 
require a runtime synchronization (hence, two barriers for figure 7), the insertion of a barrier satisfying 
the first producer/consumer constraint causes the timing of later producer/consumer pairs to be more pre­
cisely known. This often (about 28% of the time in our current studies) allows the compiler to avoid 
inserting further barriers, as shown in figure 8.
3.1. ModelsforBarrierSynchronization
We now introduce representations for barrier synchronization in concurrent processors. These 
representations will help in understanding barrier MIMD execution and design alternatives. In this work, 
the barrier embedding for a set of concurrent processors will be represented as in figure 9. The vertical 
lines represent concurrently executing processors while the horizontal lines represent barriers across the 
processors they intersect. The semantics of these barriers are that the participating processors cannot 
proceed until all have arrived at the barrier, e.g., in figure 9, processors PO, P I , ..., P4 cannot proceed past 
barrier O until all have arrived there. At that point, they all start execution of the instruction following the 
barrier simultaneously. Process execution proceeds in the downward direction.
Several concepts and results from the theory of partially ordered sets are useful in understanding 
barrier embeddings within concurrent processors. Recall that a binary relation R on a set P is a subset of 
the Cartesian product X 2, that is R c  Xv. X. Let xRy correspond to (pc, y)eR, and not(x/?y) represent (x, y) 
4 R. The binary relation <b on a set of barriers B is a partial ordering because <b is both irreflexive and 
transitive2 [Fish85]. The partially ordered set (B, <b) may be illustrated by a directed acyclic graph (dag), 
with the graph nodes representing barriers and edges representing the ordering relations <b among the
Page 8
2. A binary relation R on X is irreflexive if not xRx for every x  in X. It is transitive if 
(xRy, yRz) = >  xRz for all x , y , z m  X.
Static Scheduling




Figure 9 Figure 10
barriers. The initial barrier is defined as the barrier that extends across all processors and precedes all 
other barriers. A barrier dag for the barrier embedding in figure 9 is shown in figure 10. The initial barrier 
for this dag is bo (barrier 0). Here we see that b 2 (barrier 2) must execute before b 2 (barrier 3), hence 
b2 <b bi> and similarly b2 <* 64. Transitivity implies b2 <b b^. These properties are derived from the 
barrier semantics: barrier b 3 must be executed after the process P3 has encountered barrier b2. Similarly, 
64 must be executed after the process P2 has encountered b2.
Several terms that will be used frequently in this paper are now defined.
Total Implied Synchronizations'.
The number of edges in the directed acyclic graph (DAG) corresponding to the code generated from 
a basic block. Each edge is considered to be a producer/consumer synchronization pair.
Barrier SynchronizationFraction:
The number of barriers in the schedule divided by the Total Implied Synchronizations.
Serialized Synchronization Fraction:
The number of synchronizations satisfied by serialization, i.e., consumer assigned to same processor
Page 9
Static Scheduling
as a producer, divided by the Total Implied Synchronizations.
Static Scheduling Fraction:
The remaining fraction of total implied synchronizations after the barrier and serialized fractions are 
removed. This represents the synchronizations that are scheduled away by tracking static timing 
constraints after a barrier executes, as in the second producer/consumer synchronization of figure 8. 
In this case, no explicit synchronization instruction need be generated.
32. Barrier MIMD Hardware
Two forms of barrier MIMD are discussed in this paper: static barrier MIMD (SBM) and dynamic 
barrier MIMD (DBM). The difference between the two lies in the run-time ordering of the barriers: the 
SBM imposes an ordering at compile-time, and barriers may be delayed if this compile-time (static) ord­
ering differs from the run-time ordering. The DBM executes the barriers in whatever run-time order they 
occur. It requires an associative matching memory to achieve this, whereas the SBM requires only a 
hardware queue [OKDi90].
Although, intuitively, a synchronization operation which can span an arbitrary set of processes 
appears to imply high overhead, we believe that the new barriers can be implemented with lower over­
head than conventional binary producer/consumer synchronization. In fact, the PASM prototype 
[ScNa87] has hardware capable of implementing SBM execution and preliminary benchmarks have 
demonstrated very good performance [BrCJ89]. These benchmarks have shown the barrier execution 
mode to consistently outperform both SIMD and MIMD modes.
The detailed hardware design and performance analysis for hardware barrier synchronization is dis­
cussed in a companion paper [OKDi90]. We briefly outline it here. The SBM barrier hardware has a very 
simple structure closely resembling the enable/disable mask logic of a SIMD control unit. Each barrier is 
represented by a bit mask indicating which processors participate in that barrier, these bit masks are 
enqueued into a FIFO queue in the sequence in which they will be executed, as shown in figure 11. Pro­
cessors activate a WAIT signal when they execute a w a it  instruction. When the set of processors wait­
ing for a barrier becomes a subset of the waiting processors in the top barrier mask, the top barrier exe­
cutes and is removed from the queue. Processors participating in the executed barrier then proceed past 
the w a it  instruction. If a processor not participating in the top barrier executes a wait, the wait instruc­
tion will not complete until a barrier in which that processor participates becomes top and fires. When an 
barrier executes, all participating processors resume execution simultaneously (on the next clock tick).
Page 10
Static Scheduling
Proc. 0___  Proc. I Proc. 2 Proc. 3 t=0
TiT T1T t'T X 1TV V V
I I 0 0 <—Top 
0 0 1  I 
0 1 1 0  
1 1 0 0  
0 1 1 1
Figure 11
4. Code Scheduling and Barrier Insertion Algorithms
The code scheduling algorithm will be described for barrier MIMDs in general. The appropriate res­
trictions for the static barrier MIMD (SBM) are given at the end of this section. Although minimum and 
maximum times are known for each of the instructions instead of a fixed execution time, it is straightfor­
ward to adapt the scheduling heuristics commonly used for fixed-execution-time tasks — and this has 
been our approach. It is well-known that even for simple or relaxed cases the optimal static scheduling of 
a partially ordered set of tasks on parallel processors is NP-hard, and hence, computationally intractable 
[KaNa84], However, several heuristics with bounded worst-case performance degradation (from optimal) 
have been found to be effective for this problem [Hu82]. In particular, the critical path method exhibits 
good performance at reasonable computational cost.
Section 4.1 outlines a technique for labeling operations and section 4.2 applies these labels to gen­
erate an ordering for list scheduling. Using the list ordering, section 4.3 details the assignment of opera­
tions to specific processors. Upon assigning each operation to a processor, it may be necessary to insert a 
barrier, algorithms for this purpose are given in section 4.4.
The scheduling algorithm proceeds in two phases: ordering of the nodes based on height informa­
tion, followed by the node assignment to processors, with barrier synchronizations inserted as necessary 
during node assignment.
4.1. NodeLabeling
The scheduling algorithm assumes that the instructions are represented in an instruction DAG3 
(Directed Acyclic Graph) G(N, A), where N  is the set of n instruction nodes and A  is the set of edges
3. The directed acyclic graph for instructions will be specified in upper-case (DAG) 
whereas the directed acyclic graph for barriers will be given in lower-case (dag).
Page 11
Static Scheduling
representing the precedence (producer/consumer) constraints between instructions. If an edge is directed 
from node i to node j, node j  is said to be a successor (or consumer) of node i. Similarly, node i is said to 
be a predecessor (or producer) of node j.
The DAG is assumed to have one entry and one exit node; dummy nodes (with zero execution time) 
are added if necessary. Let t(i) represent the execution time for node (instruction) i, which is assumed to 
take integral values. For variable-execution-time instructions, Imm(I) and tmax(i) represent the minimum 
and maximum execution times, respectively, for node L For the instruction DAG, the critical path is 
defined as the longest path from the entry node to the exit node, expressed mathematically as
tcr = max £  <(0 
k >e<|>»
where (J)jk represents the yfeth path from the entry node to the exit node. Qearly, tcr represents a lower 
bound on the execution time of the instruction DAG, regardless of the number of processors that execute 
i t  The height of node i is defined as the length of the longest path from the exit node to node i where the 
orientation, or direction, of the edges are reversed, i.e.,
h(i) = max T  t(J) 
k /€*
where Jtjk represents the Mi path from the exit node to the node i.
For the variable-execution-time instructions in the DAG, the minimum and maximum height for a 
node i, /Imin(I) and ZimaxO'), are defined as follows:
Zlmm (l)  — m a x  IininC/)
k ISICk
and
^maxO) — I113X ]JS£ ImaxO) •
The minimum height corresponds to the height for node i assuming all nodes in the DAG take their 
minimum execution time, and similarly for the maximum height.
42. NodeOrdering
The maximum height and minimum height are computed for all nodes. This can be done in Oin1)  
time, since the problem reduces to finding longest paths from the exit node to all other nodes [Hu82] 
(with edge orientations reversed.) The nodes are first sorted into a list in descending order using the max­
imum height as the key, followed by another sort (on nodes with equal maximum height) in descending 




2. A 2.5 2.5  4..5
Figure 12: DAG Height Examples.
The maximum height is employed as a key first in an attempt to minimize the worst-case execution 
time, i.e., when all instructions take maximum time. The minimum height is used to break ties as it 
represents, in some sense, an attempt to optimize for the best case. For example, in figure 12, for the DAG 
on the left, hm3X(b) > hm3X(a), so node b is placed ahead of node a in the list. In the DAG on the right 
Ztmax(d) = hm!a(e), and ZtmJn must break the tie. Since Ztmm^ )  > ZtmmCd), node e is given priority in the 
list.
4.3. NodeAssignment
During this phase, the nodes are removed (in order) from the sorted list and assigned to particular 
processors. Some nodes should be placed in a processor that includes a predecessor (producer) for that 
node. This serialization of the nodes increases efficiency because it reduces the number of processors 
required and may eliminate a run-time synchronization operation. On the other hand, too much serializa­
tion can increase the schedule length. The node assignment algorithm attempts to strike a balance 
between these two competing aims.
The current node being scheduled is referred to as node i. The processor to which some node j  is 
assigned is denoted as Processor(j).
[1] The first step in node assignment is to determine the set of processors in which the predecessors of 
node t, denoted as Preds(i), are scheduled. These are referred to as the producer processors for i, or 
ProdProc(i), and are computed as follows: 
proc ProdProc(i)
for each node j  in Preds(i)
add Processor(j) to the set of producer processors
For each processor in ProdProc(i), determine if no other nodes are scheduled after Pred(i) on that 
processor. If a single producer processor meets this condition, place node i in that processor, and 
insert a barrier if necessary, as described later in this section. If more than one producer processor 
meets this condition, assign node i to the producer processor with the largest current maximum
Page 13
Static Scheduling
time (to possibly avoid inserting a barrier.) If all processors in ProdProc(i) have the same current 
maximum time, choose one at random and assign i to it.
[2] If none of the processors satisfy the condition in Step I, assign node i to a processor such that it is
scheduled as early as possible. In case of ties between processors, choose one at random. This helps 
balance the number of nodes assigned to each processor. Insert a barrier as necessary.
4.4. BarrierInsertion
Two algorithms for barrier insertion are described. The first algorithm is conservative in that it 
always adds a barrier synchronization when one is necessary, but it may add unnecessary, redundant bar­
riers. The other barrier insertion algorithm is ‘ ‘optimal’ ’ in the sense that a barrier is not inserted unless it 
is absolutely necessary. The barrier dag (B , <b) is constructed incrementally as the nodes are assigned to 
processors and barriers inserted into the schedule. It embodies the precedence constraints among the bar­
riers, as described in section 3.1.
The notion of one banier “ dominating” another is useful in constructing the barrier dag [AhSU86]. 
A barrier x dominates barrier y, written x dom y, if every path from the initial node of the barrier dag to y 













i ! 8.. 12
min=I O
Figure 13: Barrier Embedding where Conservative Algorithm Fails.
Each edge (m,v) between barriers u and v in a barrier dag contains the minimum and maximum exe­
cution time for the code between the barriers. Note that the minimum execution time for edge (u,v) is
Page 14
Static Scheduling
actually the maximum of the minimum times for all code regions between u and v. For example, in figure 
13, the minimum execution time of the code between barriers x  and y  is five, not four; recall that no pro­
cessor proceeds past the barrier until all have arrived there. This constraint means that the even if PEl 
(processing element I) executes the code between barriers x  and y in 4 time units, the barrier would still 
need to wait for PEO, which requires 5 time units. The maximum time for the edge (x,y) is 7 units.
After consumer node i has been assigned to a processor C, it is necessary to check all producers for i 
to determine if a barrier is necessary. Suppose that node g is a member of the set Preds(i), the predeces­
sors, or producers, for instruction i, and that it is assigned to processor P.
4.4.1. ConservativeBarrierInsertion
To determine if a barrier is needed between instruction nodes i and g, assigned to processors C and 
P, respectively, the following steps are performed:
[1] Define LastBar(g) as the last barrier to execute before node g. Define NextBar(g) to be the next bar­
rier to execute after node g. This step determines if a barrier or chain of barriers already exists 
between nodes NextBar(g) and LastBar(i). This would guarantee that node g executes before node 
i. This step consists of checking for a path between NextBar(g) and LastBar(i). Let us call this pro­
cedure PathFind(g,i)4. If a path is found, no barrier is needed; otherwise, continue with step [2].
[2] In this step, the static timing constraints inherent in the barrier dag are examined to check if these 
constraints resolve the synchronization statically. First, the nearest common dominating barrier for 
barriers LastBar(g) and LastBar(i), written as CommonDom(g,i) is found. The dominator informa­
tion can be stored in a dominator tree. The initial node is the root of the dominator tree, and a node 
dominates only its descendants in the tree. Hence, the common dominator CommonDom(g,i) is the 
nearest common ancestor in the dominator tree. It is the last common synchronization point for pro­
cessors P and C.
[3] From this common dominator barrier, timing information is propagated up to the producer and con­
sumer instructions g and L The length of the longest path, assuming maximum execution times for 
code regions between barriers, from CommonDom(g,i) to LastBar(g) is computed. Denote this 
longest path as \|tmax(CommonDom (g,i),LastBar (g)), and its length as
I(Vmax(CommonDom(g,i),LastBar (g))). Add the maximum time necessary to execute all instruc­
tions after LastBar(g) up to and including g, denoted as SmaxCg), to this length to yield the max­
imum time Tmax(g) to execute node g relative to the common dominating barrier.
[4] Similarly, the longest path, assuming minimum execution times for code regions, from the com­
mon dominator to LastBar(i) is computed. Denote this longest path as 
Vmin(CommonDom(g,i),LastBar(i)), and its length as l(ymjn(CommonDom(g,i),LastBar(i))). 
Let i~ represent the instruction before i on processor C. The minimum time to execute all instruc­
tions up to but not including i, denoted as Smm (O , is added to this length, yielding the minimum 
time r m;n(i~) to start executing node i relative to the common dominator.
Page 15
Static Scheduling
[5] If Tmin (i ) > TmaxCg), then no barrier is needed; otherwise, go to step [6].
[6] A barrier is inserted across processor P somewhere after the producer node g, and across processor 
C just before node i. To determine where the barrier is placed on P, we compute 
lmm(CommonDom(g,i),LastBar(i)), and then add in the maximum execution times of all instruc­
tions on C after LastBar(i) up to node i~, yielding TmaxO'- ). If TmaxO- ) S Tmax(g), then the barrier 
is inserted right after the producer node g on processor P. However, if TmaxO- ) > Tmax(g), and if 
there is some instruction g+ after instruction g, such that TmaxO- ) falls into the execution time 
range (assuming maximum times) of g+, then the barrier is inserted after g + on processor P. This 
approach allows the producer processor to execute a bit more woric after the producer instruction g.
This barrier insertion algorithm will sometimes add unnecessary barriers. For example, in the bar­
rier embedding given in figure 13, the conservative insertion algorithm will insert a barrier across proces­
sors 0 and 2, after the producer node g and before consumer node i. In the figure, LastBar(g) is y  and 
LastBar(i) is z. The common dominating barrier for y and z is x. It can be seen that Tmax (g) = 9, while 
Tmm(i~) = 8, so it would appear that a barrier is necessary.
However, the longest path from x to z, Xizmin (x,z), overlaps with the longest path from x  to y, 
V|/max(x,y), on edge (x,y); recall that different assumptions have been made about the execution time for 
this edge on the different paths. If this is taken into account, then Xizmin(X1Z) should be computed, as 
before, assuming minimum execution times for edges except for the edges which intersect X|zmax(x,y). For 
these edges, use their maximum execution time when computing the longest path. In figure 13, this means 
that edge (x,y) has value 7, (y,z) has value 2, and the minimum time for i~ is I, yielding an actual 
minimum time for node i of 10. Thus, i always executes after g and no barrier is required. In the next sec­
tion, an optimal barrier insertion algorithm that does not generate these unnecessary barriers is briefly dis­
cussed.
4.4.2. Optimal Barrier Insertion
From the previous example, it is clear that the problem with the conservative insertion algorithm is 
that it does not take into account the possibility that the longest paths from the common dominator to the 
producer and consumer nodes may overlap. In such cases, assuming maximum execution times on edges 
that Overlap may increase the minimum execution time for the consumer node just enough to resolve the 
synchronization statically.
But resolving the synchronization is not quite that simple. The “ second”  longest path (to the pro­
ducer node) must also be checked: if the execution time on this path is also greater than the consumer 
node maximum execution time, then the same check for path overlap and resulting timing adjustments 
must be made. This process continues for the decreasing longest paths to the producer node until the 




Let u be the nearest common dominator for barriers v and w, where v is LastBar(g) and w is 
LastBar(i). Recall that node i is being scheduled, node g is a producer for node i, and it is necessary to 
determine if a barrier is required between these instructions.
The relationship between the various path lengths can be expressed as
/(Ymax(W-V)) £ /(Ymax(w-v)) > /(Ymax(«-V)) > •••
—/(Ymax(w,V)) S /(Ymin(w,iv)) + (^,|j],(i ) ^maxCs)) — /(Ymax(W-V)) .
where Y4ax(w.v), 2<j<k represents the Ath longest path (assuming maximum execution times) from u to 
v. For each Y^ax(w,v), find Ymin(W-vV)* the longest path from u to w assuming minimum execution times 
except for edges on the path Yiiax(w, v), where maximum execution time edges are assumed. If the condi­
tion
/(Yiiax(W-V)) + Smax^) — /(Ymin(W-vV)) + SmJnO )
is satisfied, consider the next longest path Yiwx(w,v) and repeat the process. If the condition is not met, 
then a barrier must be inserted as described as described in step [3] of the conservative barrier insertion 
algorithm, and the scheduling algorithm starts again with the next node in the list.
The process of checking the /th longest path continues until
/(Yiiax(w> v)) 4* 5max(g) < /(Ymin(W-vV)) + SmJj1 (i' )
is met for the yth longest path, where j  = k, proving that the synchronization is satisfied statically and no 
barrier need be inserted.
4.4.3. BarrierMerging
An additional merging step is performed when inserting barriers into an SBM schedule. If the exe­
cution time range of the new barrier overlaps with any other barriers currently scheduled, and if the over­
lapping barriers are not ordered with respect to the barrier dag, then they are merged into a single barrier. 
For one set of benchmarks studied (with 10 variables and 80 statements) the merging of barriers resulted 
in 35% fewer barriers in the resulting schedules. The static scheduling fraction also increased as a result 
of the larger barriers in the SBM schedule. The merging of barriers increased the completion time for the 
SBM compared to the DBM, although these times are quite close for the benchmarks studied. More 
scheduling results are given in the next section.
5. SchedulingExperiments
The scheduling algorithms discussed in the last section5 were applied to the synthetic benchmark 
programs. The effects of varying different parameters that are related to the architecture of the machine
5. Although an optimal algorithm was presented for barrier insertion, the conservative 
algorithm was used for all the scheduling experiments. This was done because the 
conservative algorithm is much simpler and the results were very good.
Page 17
Static Scheduling
and the structure of the synthetic benchmarks have been studied. Architecture parameters that were 
varied include the number of processors and timing assigned to each instruction; barriers were assumed to 
always execute immediately upon arrival of the last participating processor. Benchmark parameters 
included the number of instructions and variables in generated programs. Particular attention has been 
paid to the different synchronization fractions and how they vary as the parameters change. These results 













0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 I 
Fraction Statically Scheduled
Figure 14: Scatter Plot (Benchmarks Contain From 65 to 132 Syncs.)
One-hundred synthetic benchmarks were generated for each set of parameters and the results aver­
aged to yield each point on the curves shown in this section. Over 3500 benchmarks were generated. 
Over all these programs, the results fell into the following ranges:
•  The barrier fraction varies from 3% to 23%
• The serialization fraction varies from 50% to 90%
•The fraction of barriers statically scheduled away varies from 8% to 40%.
Note that the last fraction represents a feature unique to barrier architectures. A scatter plot with the seri­
alization fraction on the vertical axis and the statically scheduled fraction on the horizontal axis is shown 
in figure 14. The results for more than 2000 of the synthetic benchmarks are given in the figure, and it 
can be seen that the “ center of mass” of the points lies near the 85% line; hence, about 85% of the syn­
chronizations are either seriaUzed or statically scheduled away.
Page 18
Static Scheduling
5.1. Number of Assignment Statements
In this section, the number of processors and variables are fixed (8 processors and 15 variables), 
while the number of instructions varies. This section investigates the effect of changing the number of 






Figure 15: Sync. Frac. for 8 Processors (PEs) and 15 Variables (Vars.)
As can be seen in figure 15, the fraction of barrier synchronization decreases as the number of state­
ments in the generated basic blocks increases. This decrease is relatively large when the number of state­
ments varies from 5 to 20. Generally speaking, the Load operations are scheduled for early execution. 
Since Load is a variable-execution-time instruction (from I and 4 time units), barriers are often required 
after a Load. Hence, there is a concentration of barriers at the the beginning of the execution of the basic 
block. The barrier fraction decreases when we increase the number of statements because the percentage 
of Loads becomes smaller. This effect is counter-balanced at larger benchmark sizes because Mul, 
Div, and Mod instructions begin appearing in generated benchmarks, and these instructions have large 
execution time variation.
Notice that the serialization fraction decreases as the benchmark size increases. Larger benchmarks 
make it less likely that a consumer can be placed directly after a producer because another instruction has 
probably been scheduled there already.
Page 19
Static Scheduling
5.2. Number of Variables
In this section, the benchmark size and number of processors are fixed (60 statements and 8 proces­
sors), while number of variables is changed. Referring to figure 16, as the number of variables increases, 
the barrier fraction first increases, then remains constant. Since the number of variables corresponds 
closely to the parallelism width of the benchmarks, the parallelism width increases with the number of 
variables, and the scheduling algorithm employs more processors in the schedule. The result is that more 
barriers are generated until the parallelism width of the benchmark exceeds the number of processors, and 





Figure 16: Sync Fractions for 8 PEs and 60 Instmctions (Instrs.)
The serialization fraction decreases when more variables are used because for a large number of 
variables, the parallelism width is large, and fewer opportunities for serialization exist.
53. Number of Processors
In this section, the benchmark size and number of variables are fixed, while the number of proces­
sors is varied. For two variables, increasing the number of processors did not affect the barrier fraction 
because the scheduling algorithm keeps almost all of the instructions in two processors. The serialization 
















I  I I I I I I
4 6 8 10 12 14 16
Number of Processors
Figure 17: Sync Fractions for 100 Instrs. and 10 Vars.
For five variables, when the number of processors is increased from two to four the barrier fraction 
increases because more synchronization is required between the different processors. The barrier fraction 
stabilizes after four processors as the scheduling algorithm employs no more than four processors. For N 
variables, the barrier fraction increases when the number of processors employed is less than the parallel 
width. When the number of processors used is greater than the parallelism width, then the barrier fraction 
remains constant.
Figure 17 illustrates the effect of increasing the number of processors on the different synchroniza­
tion for a barrier machine, this figure is for 100-assignment statement basic blocks with 10 variables.
The serialization fraction remained nearly constant as the number of processors increased. There are 
two effects contributing to this serialization rate behavior that tend to cancel each other out, resulting in a 
fixed serialization fraction. The first effect is that for small numbers of processors, the scheduling algo­
rithm often inadvertently serializes a synchronization operation. Conversely, the scheduling algorithm 
has fewer opportunities to serialize for a barrier machine with a small number of processors, because 
quite often the “ serialization slot”  will already be filled. The first effect results in an increase in the seri­




5.4. Analysis of the Heuristics
The heuristics employed in code scheduling were analyzed by first modifying them in some way 
and then scheduling the same synthetic benchmarks on the original and new version of the heuristic. Vari­
ous characteristics of the resulting schedules, including execution time and synchronization fractions, 
were compared to gain insights into the performance of each heuristic.
The node assignment step was modified to perform simple round-robin scheduling, where the z'th 
node in the sorted list is scheduled on processor (z modulo N), where N  is the number of processors. As 
expected, the serialization fraction nearly vanished for large numbers of processors; the barrier fraction 
also increased significantly, in some cases reaching 50%. Both the minimum and maximum execution 
times increased, although the execution time difference between list scheduling and pure round-robin 
became smaller for larger numbers of processors. These results suggest that the node assignment heuris­
tics employed in the scheduling algorithm were reasonable.
Another change to the node assignment heuristic switched the ordering priority: the minimum 
height ZzmIn was employed first in the sort of the instruction nodes, followed by the maximum height to 
break ties. As expected, the minimum execution time of the benchmarks decreased while the maximum 
time increased. This is not surprising as the new ordering, in some sense, attempts to optimize the 
minimum execution time. However, the changes were quite small.
An attempt was made to increase the serialization rate by perforating lookahead on the sorted 
instruction list. Instructions in a window of size p  on the top of the list were examined before a node was 
assigned to a processor to see if such an assignment would preclude a later serialization assignment. The 
effects were interesting: the serialization fraction increased as expected, but not very much for large 
numbers of processors, as the scheduling algorithm keeps the serial streams on a single path without loo­
kahead. For small numbers of processors, the execution time increased from 10% to 30% due to an 
increase in the critical path length brought on by the additional serializations. This increase disappears 
for a large number of processors.
Another parameter of interest is the timing variation of the instructions. Synthetic programs were 
generated with instructions having very large timing variations. The results showed that the barrier sync 
fraction was not very sensitive to increases in instruction timing variation, increasing only slightly for 
large variations.
6. VLIW vs Barrier Architecture
A comparison between barrier MIMD and VLIW execution of synthetic benchmark programs was 
undertaken. The results are given in figure 18.
The VLIW execution mode used in scheduling the instructions assumed that all instructions 
required their maximum time to execute. No asynchrony was allowed in VLIW execution. As can be 
seen in the figure, the maximum times for both the barrier MIMD and VLIW were nearly identical. Exe­













Maximum DBM Time 
VLIWTime 
Minimum DBM Time
1 I I I I I I I
2 4 6 8 10 12 14 16
Number of Processors
Figure 18: VLIW vs Barrier Architecture (60 Instrs. and 10 Vars.) 
took slightly longer to complete execution for smaller numbers of processors because more barriers were 
required due to instruction execution variatioa
The minimum barrier MIMD completion time was about 25% lower than the VLIW completion 
time, and average barrier completion time will fall between the minimum and maximum times, the exact 
value being determined by the probability distributions of the variable-execution-time instructions. It 
should be noted that an optimal schedule (completion time equal to the critical path time) was determined 
for almost all the synthetic benchmarks for the comparison.
These results suggest that it is important to reduce the maximum execution time for instructions, 
and this is traditionally done in VLIWs by adding extra hardware for such operations as integer and 
floating-point multiplies. Also, memory reference delays are reduced by having multiple paths to 
memory banks and through careful scheduling of memory references to avoid bank conflicts [CNOPR88].
7. Conclusions
VLIW architectures have proven to be capable of providing consistently good performance over a 
larger range of programs than vector processors; the barrier MIMD architectures hold great promise for 
extending this range even further. Whereas VLIW architectures cannot achieve efficient parallel execu­
tion of w h ile  loops, subroutine calls, and variable-execution-time instructions, barrier MIMDs provide 
a hardware mechanism which allows VLIW-Iike static scheduling to be applied to all these constructs.
Page 23
Static Scheduling
Barrier MIMD hardware has been shown to be easily and efficiently implementable [OKDi90]; 
hence, the prime question is whether a reasonable approximate solution to the NP-complete problem of 
static code scheduling for such machines can be found. In this paper, we have presented a set of algo­
rithms which efficiently implements this code scheduling. Further, we have shown the proposed algo­
rithms to perform very well, citing the results from applying these algorithms to over 3500 synthetic 
benchmark programs.
Ongoing work includes the extension of the basic scheduling techniques to more complex code 
structures (including arbitrary control flow) and the possible application of the barrier scheduling tech­
niques to remove some synchronizations in conventional MIMD architectures. We are also working 
toward a complete machine design, the CARP (Compiler-oriented Architecture Research at Purdue) 













A. V. Aho, R. Sethi, and J. D. UUman, Compilers: Principles, Techniques, and Tools, 
Addison-Wesley, Reading, MA, 1986.
A.V. Aho, J.E. Hopcroft, and J.D. UUman, The Design and Analysis o f Computer Algo­
rithms. Addison-Wesley, Reading, MA, 1974.
W.G. Alexander and D.B. Wortman, “ Static and Dynamic Characteristics of XPL Pro­
grams,’ ’ IEEE Computer, November 1975, pp. 41-46.
E. C. Bronson, T. L. Casavant, L. H. Jamieson, “ Experimental Application-Driven Archi­
tecture Analysis of a SIMD/MIMD ParaUel Processing System,” Proc. 1989 Int. Conf 
Parallel Processing, vol. I, pp. 59-67, Aug. 1989. Also to appear in IEEE Transactions on 
Parallel and Distributed Systems, Spring 1990.
C.D. CaUahan n, A Global Approach to the Detection o f Parallelism, PhD. Dissertation, 
Rice University, March 1987.
R. P. ColweU, R. P. Nix, J. J. O’DonneU, D. B. Papworth, and P. K. Rodman, "A VLIW 
Architecture for a Trace Scheduling CompUer," IEEE Trans, on Computers, vol. C-37, no. 
8, pp. 967-979, Aug. 1988.
H.G. Dietz and T. Schwederski, “Extending Static Synchronization Beyond VLIW,” 
Technical Report TR-EE 88-25, Purdue University, School of Electrical Engineering, June 
1988.
H. G. Dietz, T. Schwederski, M. T. O’Keefe, and A. Zaafrani, "“ Extending Static Syn­
chronization Beyond VLIW,” Supercomputing 89, pp. 416-425, Reno, NV, Nov. 1989.
H.G. Dietz, H.J. Siegel, W.E. Cohen, M.T. O’Keefe, et al., “ A CompUer-Oriented Archi­
tecture: The CARP Machine,” Fourth SIAM Conference on Parallel Processing for  
Scientific Computing, Chicago, IL, December 1989.










T. C. Hu, Combinatorial Algorithms. Addison-Wesley: ReadingvMA, 1982.
H. Kasahara and S. Narita, “Practical Multiprocessor Scheduling Algorithms for Efficient 
Parallel Processing,” IEEE Trans, on Computers, vol. C-33, no. 11, pp. 1023-1029, 
November 1984.
M. T. O’Keefe, Barrier MIMD Architectures: Performance Analysis, Design, and Compi­
lation, Ph.D. dissertation, in preparation. School OfElectrical Engineering, Purdue Univer­
sity, Spring 1990.
M. T. O’Keefe and H. G. Dietz, “ Hardware Barrier Synchronization,” submitted to 1990 
International Conference on Parallel Processing, St. Charles, IL, August 1990.
T. Schwederski, W. G. Nation, H. J. Siegel, and D. G. Meyer, “ The Implementation of the 
PASM Prototype Control Hierarchy,” Proceedings o f the Second International Confer­
ence on Supercomputing, vol. 1,1987, pp. 418-427.
P.L. Shaffer, “ Minimization of Interprocessor Synchronization in Multiprocessors with 
Shared and Private Memory,” Proc. 1989 Int. Conf Parallel Processing, vol. Ill, pp. 
138-142, Aug. 1989, St. Charles, IL.
Page 25
