Abstract. In a SIMD or VL1W machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.
Introduction
Run-time synchronization overhead is a critical factor in achieving high speedup using parallel computers. A key advantage of SIMD (single instruction stream/multiple data stream) architectures is that synchronization is effected statically at compile time; hence, the execution-time cost of synchronization between "processes" is essentially zero. VLIW (very long instruction word) [Ellis 1985; Colwell 1988; Fisher 1984] machines are successful in large part because they preserve this property while providihg more flexibility in terms of the operations that can be parallelized. Unfortunately, VLIWs cannot tolerate any asynchrony in their operation; hence, they are incapable of parallel execution of multiple flow paths, subprogram calls, and variable execution-time instructions. In a recent paper [Dietz, Schwederski et al. 1989 ] a new architecture was proposed that extended the static synchronization properties of the SIMD and VLIW class of parallel machines into the MIMD domain. The new architecture is called barrier MIMD.
In this paper we describe scheduling and barrier placement algorithms for barrier MIMDs as well as extensive results from scheduling synthetic benchmarks using the new algorithms. A barrier MIMD is a MIMD computer that has specialized hardware implementing a new type of barrier synchronization which allows the compiler to perform static scheduling. Ira barrier is placed across a set of processes, then no process can execute past that barrier until all have reached the barrier. Unlike other barrier mechanisms, all processes will resume execution in exact synchrony. 1 Hence, immediately after executing a barrier, the machine can be treated as a VLIW, using static scheduling to eliminate the need for further runtime synchronization.
However, VLIW machines do not allow MIMD code structures (e.g., multiple flow paths) nor even variable time instructions. In a barrier MIMD machine, static scheduling tracks both minimum and maximum completion times for each process's code; run-time synchronization is needed if and only if the minimum start time for the consumer of an object is less than the maximum completion time for that object's producer. If the timing constraints cannot be met statically, this implies that the static timing information has become too "fuzzy." Inserting another barrier effectively reduces this fuzziness to zero. Based on the static scheduling work described in this paper, more than 77 % of all synchronizations that would occur in execution on a conventional MIMD will be accomplished without runtime synchronization in a barrier MIMD.
When barrier MIMD architectures were originally proposed [Dietz and Schwederski 1988; Dietz, Schwederski et al. 1989 ], an algorithm for inserting barriers while scheduling code was given. The algorithm attempted to minimize the number of barriers required by using the static timing constraints inherent in the barrier synchronization operation. In these previous papers, no implementation of the barrier insertion algorithm was attempted, nor was there any algorithm proposed for the actual code scheduling [Dietz, Schwederski et al. 1989 ]. This work contains three new results: a code scheduling algorithm for barrier MIMDs, an "optimal" barrier insertion algorithm, and extensive scheduling experiments on synthetic benchmarks using the new algorithms.
The paper is organized as follows. Section 2 describes the structure of the synthetic benchmark programs, while Section 3 explains the principles of operation for a barrier MIMD. Section 4 gives the scheduling and barrier insertion algorithms and is followed by the description of the scheduling experiments in Section 5. Section 6 provides a comparison between VLIW and SBM performance for the synthetic benchmarks used in this paper. Finally, Section 7 gives conclusions and describes current research efforts.
Structure of the Synthetic Benchmark Programs
This study focuses on fine-grain scheduling of a single-chip multiprocessor RISC node [Dietz, Siegel et al. 1989 ] that employs the barrier mechanism discussed in this paper. Expensive operations such as multiplication and division are implemented as data-dependent code sequences that introduce asynchrony into the chip operation. Memory accesses across a shared bus or interconnection network involve contention that also involves stochastic delays. It is shown that static scheduling may still be used to advantage within this framework.
In this work we wished to characterize and study the extent to which static scheduling can be employed in barrier MIMDs. In particular, measurements of the number of synchronizations that are satisfied statically, at compile time, versus the number that require explicit synchronization instructions executed at run time were desired. To this end a compiler was developed for a simple language.
The easiest way to measure performance would have been to examine the behavior of a few real programs, but the statistics for any small number of real programs can be highly misleading. By synthesizing a large number of benchmarks with carefully controlled properties, it is possible to quote more than just performance on a few cases--one can directly determine the relationship between some aspects of program structure and the resulting performance.
For this reason the programs to be scheduled on barrier machines were automatically generated using common instruction execution frequencies [Alexander and Wortman 1975] . This allowed us to automatically generate a very large number of synthetic benchmarks from which summary statistics were obtained. It also made it quite simple to change the various characteristics of generated programs to observe the effects on the statistics of scheduled programs.
The synthesized programs did not contain subroutine calls or loops. This is realistic in that it shows how the barrier solution compares with VLIW trace scheduling [Ellis 1985; Fisher 1984] , which focuses on parallel execution of large basic blocks. Because these results do not take advantage of the ability of barrier MIMDs to execute arbitrary control flow code (e.g., wh i I e loops and calls) in parallel, they give a lower bound for the performance of barrier MIMD relative to VLIW. These results are also useful in projecting performance of barrier MIMD for relatively large-grain tasks (e.g., parallel subroutine calls) in that, with appropriate adjustment to the minimum and maximum execution times, each "instruction" in the synthetic benchmarks can represent a large-grain task.
Benchmark Instruction Set
The scheduling program takes as input a basic block of instructions. A basic block is a region of code that contains a sequence of consecutive statements. This region should have a single entry point and no embedded control structures [Aho et al. 1986 ]. There are nine instructions generated from the synthetic code sequences in the instruction set: Four of these nine instructions have variable execution time. The variable time instructions are as follows:
Load

Idul
Execution time varies from one to four time units. In a shared-bus multiprocessor, this difference is mainly due to different access times between local cache and main memory. Typically, an access to the main memory is anywhere from four to twenty times longer than an access to cache. A more pronounced difference between local and nonlocal memory is often found in multiprocessors that require nonlocal accesses to go through single-or multi-stage interconnection networks. Execution time varies from 16 to 24 units of time. This corresponds to a multiplier operation that is implemented either using shift and add instructions, or in an asynchronous hardware design. In either case, execution time is variable to take advantage of data-dependent optimizations. Synchronous designs with constant execution time are possible, but require more hardware, typically in the form of pipelines. Since multiplication is a commonly executed instruction, this additional hardware can sometimes be justified. D i v Execution time varies from 24 to 32 time units, for much the same reason as multiply.
Asynchronous designs are much more common for division because the operation is inherently harder to pipeline and the additional hardware is not justified by its typically low execution frequency. Mod The execution time of the modulus operation Mod varies between 24 and 32 time units, for the same reasons as division.
For the other operations it is realistic to assume that they have a constant execution time of one unit. These operations are Or, And, Add, Sub, and Store. Although interrupts could extend instruction execution times, it is reasonable either to stall the entire machine or to postpone interrupts during time-critical sections of code between barriers . Table 1 summarizes the instruction frequencies and execution-time ranges.
Benchmark Synthesis
A C program was developed to randomly generate the basic blocks according to the statistics described below. This program requires as input the number of statements, variables, and constants desired in the generated code. It then generates a random sequence of assignment statements satisfying the desired conditions. The frequency of the assignment statements corresponds loosely to the instruction frequency distributions found in [Alexander and Wortman 1975] . Note that in Table 1 the frequencies of load and store are not given. These instructions are provided as necessary during code generation and optimization: The first reference to a variable causes a load for that variable to be generated, and a store is generated when a variable is assigned a value. During code generation the randomly generated assignment statements are optimized using standard local optimizations, including common subexpression elimination, constant folding and value propagation, and dead code elimination [Aho et al. 1986 ]. Hence, the resulting synthetic benchmark does not contain "redundant" parallelism that might skew the results.
An example synthetic benchmark is shown in Figure 1 , and its corresponding DAG (directed acyclic graph) is shown in Figure 2 . In Figure 1 the leftmost column represents the tuple number. Each tuple is incrementally assigned a number as it is produced by the code generator. Many tuples are not represented because they were removed by the optimizer. The two rightmost columns represent the minimum finish time and maximum finish time assuming unlimited parallelism. These columns will help in ordering the tuples, as will be explained in Section 4. In the instruction DAG the instructions are represented as nodes while edges represent the precedence constraints between instructions This DAG is important to both code optimization and the scheduling algorithm.
Parameters for both the generated code and the scheduling algorithms can be varied. The machine size for the scheduling algorithm is varied from 2 to 128 processing dements. The parameters for the random sequences of assignment statements include the number of variables and statements. Note that the parallelism width after optimization is within a constant factor (-0.65) of the number of variables. For a fixed number of processors the number of variables can be varied from 2 to 15. For a fixed number of processors and a fixed number of variables, the number of statements can be varied from 5 to 100. The larger basic blocks sizes approximate a long instruction trace found in VLIW scheduling [Ellis 1985 ].
Barrier MIMD Principles of Operation
Figures 3 through 8 illustrate the basic scheduling concepts behind barrier MIMD architectures. Figure 3 depicts the use of conventional directed synchronization to insure that the producer executes before the consumer. This transmission could take a potentially unbounded amount of time dependent on, for example, routing and traffic through a network; hence, the only timing information available at compile time is that the consumer will execute at some time after the producer.
Proc. Instead, suppose that compiler analysis attempts to precisely track the minimum and maximum times at which the producer and consumer would execute without using any run-time synchronization. As shown in Figure 4 , if the minimum consumer time is greater than the maximum producer time, then no run-time synchronization is required. If not, as in Figure 5 , then it is necessary to insert a barrier to impose ordering constraints that will be known to satisfy the producer/consumer relationship. This is shown in Figure 6 .
Shaffer [1989] and others have applied a transitive reduction [Aho et al. 1974 ] to task graphs to remove redundant synchronization in code executing on MIMD architectures. Callahan [1987] proposed a similar method for reducing the number of (conventional) barrier synchronizations required in scheduling nested loop constructs. However, these techniques will only remove synchronizations based on graph structure, rather than on the knowledge of minimum and maximum execution time bounds as we propose. In addition to removing synchronizations based on the structure of the task graph, we can safely remove those synchronizations constrained by timing to be satisfied despite the lack of a subsuming synchronization.
There is also a secondary effect unique to our scheduling for the proposed barrier architectures. Typically, there will be a sequence of several producers and consumers in the code being analyzed and scheduled, as shown in Figure 7 . Although at first it seems that each such producer/consumer pair will require a run-time synchronization (hence, two barriers for Figure 7 ), the insertion of a barrier satisfying the first producer/consumer constraint causes the timing of later producer/consumer pairs to be more precisely known. This often (about 28% of the time in our current studies) allows the compiler to avoid inserting further barriers, as shown in Figure 8 .
Models for Barrier Synchronization
We now introduce representations for barrier synchronization in concurrent processors. These representations will help in understanding barrier MIMD execution and design alternatives. In this work, the barrier embedding for a set of concurrent processors will be represented as in Figure 9 . The vertical lines represent concurrently executing processors while the horizontal fines represent barriers across the processors they intersect. The semantics of these barriers are that the participating processors cannot proceed until all have R on a set X is a subset of the Cartesian product X2; that is R c_ X • X. Let xRy correspond to (x, y) E R, and not (xRy) represent (x, y) t~ R. The binary relation < b on a set of barriers B is a partial ordering because < b is both irreflexive and transitive [Fishburn 1985 ].2 The partially ordered set (B, <b) may be illustrated by a directed acyclic graph (DAG), 3 with the graph nodes representing barriers and edges representing the ordering relations < b among the barriers. The initial barrier is defined as the barrier that extends across all processors and precedes all other barriers. A barrier DAG for the barrier embedding in Figure  9 is shown in Figure 10 . The initial barrier for this DAG is b0 (barrier 0). Here we see that b2 (barrier 2) must execute before b3 (barrier 3); hence b2 <b b3, and similarly b3 < b b4. Transitivity implies b2 < b b4. These properties are derived from the barrier semantics: Barrier b3 must be executed after processorp3 has encountered barrier b2. Similarly, b4 must be executed after processor P2 has encountered b3. Several terms that will be used frequently in this paper are now defined.
Total implied synchronizations. The number of edges in the directed acyclic graph (DAG)
corresponding to the code generated from a basic block. Each edge is considered to be a producer/consumer synchronization pair. Barrier synchronization fraction. The number of barriers in the schedule divided by the total implied synchronizations.
Serialized synchronization fraction. The number of synchronizations satisfied by serialization, that is, the number of consumers assigned to the same processor as the corresponding producer, divided by the total implied synchronizations.
Static scheduling fraction. The remaining fraction of total implied synchronizations after the barrier and serialized fractions are removed. It represents the synchronizations that are scheduled away by tracking static timing constraints after a barrier executes, as in the second producer/consumer synchronization of Figure 8 . In this case, no explicit synchronization instruction need be generated.
Barrier MIMD Hardware
Two forms of barrier MIMD are discussed in this paper: static barrier MIMD (SBM) [O' Keefe and Dietz 1990b] and dynamic barrier MIMD (DBM) [O' Keefe and Dietz 1990a] .
The difference between the two lies in the run-time ordering of the barriers: The SBM imposes an ordering at compile time, and barriers may be delayed if this compile-time (static) ordering differs from the run-time ordering. The DBM executes the barriers in whatever run-time order they occur. It requires an associative matching memory to achieve this, whereas the SBM requires only a hardware queue. Although, intuitively, a synchronization operation that can span an arbitrary set of processes appears to imply high overhead, we believe that the new barriers can be implemented with lower overhead than conventional binary producer/consumer synchronization. In fact, the PASM prototype [Schwederski et al. 1987 ] has hardware capable of implementing SBM execution, and preliminary benchmarks have demonstrated very good performance [Bronson et al. 1990] . These benchmarks have shown the barrier execution mode to consistently outperform both SIMD and MIMD modes.
The detailed hardware design and performance analysis for hardware barrier synchronization is discussed in separate papers Dietz 1990a, 1990b] . We briefly outline it here. The SBM barrier hardware has a very simple structure closely resembling the enable/ disable mask logic of a SIMD control unit. Each barrier is represented by a bit mask indicating which processors participate in that barrier; these bit masks are enqueued into a FIFO queue in the sequence in which they will be executed, as shown in Figure 11 . Processors activate a WAIT signal when they execute a wa i t instruction. The barrier at the head of the queue executes and is removed from the queue when the set of processors waiting for a barrier becomes an improper superset of the processors that are to participate in the head barrier. Processors participating in the executed barrier then proceed past their wa i t instructions. If a processor not participating in the currently active barrier executes a wa i t, the wa i t instruction will not complete until a barrier in which that processor participates becomes active and fires. After a barrier executes, all participating processors resume execution simultaneously (on the next clock tick).
Code Scheduling and Barrier Insertion Algorithms
The code scheduling algorithm will be described for barrier MIMDs in general. The appropriate restrictions for the static barrier MIMD (SBM) are given at the end of this section. Although minimum and maximum times are known for each of the instructions instead of a fixed execution time, it is straightforward to adapt the scheduling heuristics commonly used for fixed execution-time tasks--and this has been our approach. It is well known that even for simple or relaxed cases the optimal static scheduling of a partially ordered set of tasks on parallel processors is NP-hard and hence computationally intractable [Kasahara and Narita 1984] . However, several heuristics with bounded worst-case performance degrada-tion (from optimal) have been found to be effective for this problem [Hu 1982] . In particular, the critical path method exhibits good performance at reasonable computational cost. Section 4.1 outlines a technique for labeling operations and Section 4.2 applies these labels to generate an ordering for list scheduling. Using the list ordering, Section 4.3 details the assignment of operations to specific processors. Upon assigning each operation to a processor, it may be necessary to insert a barrier; algorithms for this purpose are given in Section 4.4.
The scheduling algorithm proceeds in two phases: ordering of the nodes based on height information, followed by the node assignment to processors, with barrier synchronizations inserted as necessary during node assignment.
Node Labeling
The scheduling algorithm assumes that the instructions are represented in an instruction DAG (directed acyclic graph) I(N, A), where N is the set of m instruction nodes and A is the set of s arcs representing the precedence (producer/consumer) constraints between
instructions. An arc directed from node n i to node nj is represented as (ni, nj) . If arc (ni, nj) exists, node nj is said to be a successor (or consumer) of ni; similarly, node n i is said to be a predecessor (or producer) of nj.
The entry node of the instruction DAG must be executed before all other nodes in the DAG. Similarly, the exit node must be executed after all other nodes in the DAG. The instruction DAG is assumed to have only one entry and one exit node; dummy exit or entry nodes (with zero execution time) can be added as necessary to satisfy this condition 4 Let t(ni) represent the execution time for instruction ni, 1 <_ i < m, which is assumed to take integral values. For variable execution-time instructions, train (ni) and tmax(ni) represent the minimum and maximum execution times, respectively, for n i. These times are represented as a tuple (tmin(ni), tmax(ni) ) associated with n i. For the instruction DAG the critical path is defined as the longest path from the entry node to the exit node. The length of the critical path tcr is given by
where 7rk(n~, nm) represents the k-th path from the entry node (nl) to the exit node (nm). Clearly, tcr represents a lower bound on the execution time of the instruction DAG/, regardless of the number of processors that execute it. The height of n i is defined as the length of the longest path from the exit node n m to n i where the orientation, or direction, of the arc is reversed, that is,
where 7rk(nm, ni) represents the k-th path from the exit node to n i.
For the variable execution-time instructions in the DAG the minimum height and maximum height for ni, h~n(ni), and hr~x(ni) are defined as follows:
k nj E lrk (nm,ni) and
k nj E lrk (nm,ni) The minimum (maximum) height corresponds to the height for n i assuming all nodes take their minimum (maximum) execution time. A height tuple (hmtn(ni), hmax(ni)) is associated with n i.
Node Ordering
The maximum height and minimum height are computed for all nodes. This can be done in O(m 2) time because the problem reduces to finding longest paths from the exit node to all other nodes [Hu 1982 ] (with arc orientations reversed). The nodes are first sorted into a list in nonincreasing order using the maximum height as the key, and ties between nodes with equal maximum height are broken using the minimum height (largest first) as the key. The complexity of this sort procedure is no worse than O(m log2m) [Aho et al. 1974] .
The maximum height is employed as a key first in an attempt to minimize the worst-case execution time, that is, when all instructions take maximum time. The minimum height is used to break ties because it represents, in some sense, an attempt to optimize for the best case. For example, in Figure 12 for the DAG on the left hmax(nj) > hmax(ni), so nj is placed ahead of ni in the list. In the DAG on the right hmax(ni) = hmax(nj), and hmin must break the tie. Since hr~m(nj) > hmm(ni), nj is given priority in the list. This intuitively makes sense in that there will be more work to do in nj, on average, than in n i. Other heuristics, such as most immediate successors first [Kasahara and Narita 1984] , could be employed to break ties. 
Node Assignment
During this phase the nodes are removed (in order) from the sorted list and assigned to the set of processors P. Some nodes are placed in a processor that includes a predecessor (producer) for that node. This serialization of the nodes increases efficiency because it reduces the number of processors required and may eliminate a run-time synchronization operation. On the other hand, too much serialization can increase the schedule length. The node assignment algorithm (see Figure 13 ) attempts to strike a balance between these two competing aims.
In the following description of the node assignment and barrier insertion algorithms, node n i is scheduled on some processor pj~ P, 1 < j < q in the set of processors P. Barriers bk ~ B, 1 < k < r are added to the set of barriers B as necessary to insure proper timing between processors. In the algorithm, Preds(ni) is the set of all predecessors of n i.
The first step in node assignment is to determine the set of processors in which the predecessors of n i are scheduled, These processors are placed in set ProdProc. The last instruction scheduled on each processor is examined to determine if it is a member of Preds(ni): If no instruction meets this condition, then n i is assigned to processor pj with the current minimum finishing time; if only a single predecessor node is scheduled last, on pj, then n i is assigned to pj; in case several processors satisfy the condition, n i is scheduled on pj with the maximum finish time. Ties between processors are broken by giving the lower numbered processor priority. Figure. 13. Node assignment algorithm.
Other schemes for node assignment are possible. In particular, it would be useful to consider satisfying all synchronizations with a single barrier across multiple processors. This might be slightly less efficient for a particular synchronization operation, but providing tighter timing constraints between more processors after the barrier can result in the static resolution of more synchronizations. Such an approach would not be as important in the scheduling experiments in this paper because the instruction DAGs in this study have a maximum of two predecessors per node. In more general DAGs, this change would probably be beneficial.
The time complexity of node assignment is O (m, v) , where v is the maximum number of predecessor instructions for some node in the DAG. In most cases, v will be small compared to m, and the average running time of node assignment should be O(m).
After a node is assigned to a processor, the barrier insertion routine is called to determine if any barriers are required. This algorithm is given in the following section.
Barrier Insertion
In this section, two algorithms for barrier insertion are described. The first algorithm is conservative in that it always adds a barrier synchronization when one is necessary, but it may add unnecessary, redundant barriers. The other barrier insertion algorithm is optimal in the sense that a barrier is not inserted unless it is absolutely necessary at the time of the insertion. The barrier DAG (B, <b) is constructed incrementally as the nodes are assigned to processors and barriers inserted into the schedule. It embodies the precedence constraints among the barriers, as described in Section 3.1.
The notion of one barrier dominating another is useful in constructing the barrier DAG [Aho et al. 1986] . A barrier b i dominates barrier bj, written b i dom bj, if every path starting at the initial node of the barrier DAG and ending in bj includes b i. With this definition the initial barrier dominates all other barriers in the DAG and every barrier dominates itself. This information about a barrier DAG is represented as a dominator tree in which the initial node is the root; a barrier dominates only itself and its descendants in the tree. The existence of the dominator tree follows from the fact that each barrier b i has a unique immediate dominator that is the last dominator of b i on any path from the initial barrier to b i [Aho et al. 1986 ]. We define the common dominator for two barriers b i and bg to be the nearest common ancestor in the dominator tree. Each arc (bi, bj) between barriers b i and bj in a barrier DAG contains the minimum and maximum execution time for the code between the barriers. Note that the minimum execution time for arc (bi, bj) is actually the maximum of the minimum times for all code regions between bi and bj. For example, in Figure 14 the minimum execution time of the code between barriers b c and bg is five, not four; recall that no processor proceeds past the barrier until all have arrived there. This constraint means that even if pl executes the code between barriers bc and bg in four time units, the barrier would still need to wait for Po, which requires five time units. The maximum time for the arc (bc, bg) is seven units.
Conservative barrier insertion.
After n i has been assigned to a processor, it is necessary to check all producers for n i to determine if a barrier is necessary. Let ng be a member of the set Preds(ni). Figure 15 gives the conservative barrier insertion algorithm. In the algorithm, function LastBar(ni) returns the last barrier b i to execute before ni; NextBar(ng) returns the next barrier b n to execute after ng. If a path exists between these two barriers in the barrier DAG (B, < b), this would guarantee that ng executes before ni; routine FindPath(bn, bi) returns true if a path exists? Let r = lIB I I, the number of barriers in DAG B and e the number of edges. Note that an upper bound on r is the number of arcs, s, in the instruction DAG I. A loose upper bound on e is sq, where q = liP II, the number of processors in P. Determining if there is a path between two barriers in the barrier DAG requires O(max(r, e)), or O(max(s, sq)) which reduces to O(sq).
If no path exists, timing constraints past the common dominator barrier, the nearest common ancestor in the dominator tree, are computed. Function CommonDominator(bi, bg) returns the common dominator b c for b i and bg, where bg is returned by LastBar(ng). The common dominator is the last common synchronization point for barriers b i and bg. Computing the dominator tree for B requires O(e log e) time, or in terms ors and q, O(sq log sq).
From bc timing information is propagated to the producer and consumer instructions ng and n i. For ng the longest path l___.path cg, assuming maximum execution times for instructions, from b c to bg is determined by function LongestPathMax(bc, bg, q~), and its length by function Length(l__.path cg). The maximum time necessary to execute all instructions after bg up to and including ng, delta____.g, is added to this length to yield the maximum time max____produce to execute ng relative to the common dominator barrier b c.
Similarly, the longest path l____path__ci from b c to bi, assuming minimum execution times for instructions, is found by function LongestPathMin(bc, bi, ~); the minimum time to execute all instructions past b i up to but not including n i is added to this length, yielding min consume. These longest path computations are O(s z) time complexity; therefore, it follows that node assignment for m nodes has O(ms 2) complexity. If min__consume >_ max___.produce, the consumer executes after the producer, and no barrier is needed. Otherwise, a barrier is inserted just before n i and somewhere after ng. The exact point is determined by computing the maximum time to execute ni; the barrier is inserted after instruction nh, where the time range of nh overlaps with the maximum time for n i. The new barrier is added to the barrier DAG by procedure add barrier (ni, nh) ; the dominator tree is computed for the new DAG by procedure update__dominator__tree.
This conservative barrier insertion algorithm will sometimes add unnecessary barriers. For example, in the barrier embedding given in Figure 14 , the conservative insertion algorithm will insert a barrier across processors Po and P2, after the producer node ng and before the consumer node n i. In the figure LastBar(ng) is bg and LastBar(ni) is b i. The common dominating barrier for bg and b i is bc. Note that Tmax(ng), the maximum time to execute ng relative to bc, equals nine. Let rt i-represent the instruction scheduled immediately before n i on the same processor; then Tmia(ni-), the minimum time to execute ni-, is eight. Hence, it appears that a barrier is necessary.
However, the longest path from b c to bi, Xllnfm(bc, bi) , overlaps with the longest path from bc to bg, ~1max (bc, bg) , on arc (be, bg); recall that different assumptions have been made about the execution time for this arc on the different paths. If this is taken into account, then ~1mi~(b~, bi) should be computed, as before, assuming minimum execution times for arcs except for the arcs which intersect ~lmax (bc, bg) . For these arcs the maximum execution time should be used when computing the longest path. In Figure 14 , this means that arc (bc, bg) has value 7, (bg, bi) has value 2, and the minimum time for n i-is one, yielding an actual minimum start time for node n i of ten. Thus, n i always executes after ng and no barrier is required. In the next section an optimal barrier insertion algorithm that does not generate these unnecessary barriers is described.
Optimal barrier insertion.
From the previous example, it is clear that the problem with the conservative insertion algorithm is that it does not take into account the possibility that the longest paths from the common dominator to the producer and consumer nodes may overlap. In such cases, assuming maximum execution times on arcs that overlap may increase the minimum execution time for the consumer node enough to resolve the synchronization statically.
However, resolving the synchronization statically is not quite that simple. The "second" longest path (to the producer node) must also be checked: If the execution time on this path is also greater than the consumer node minimum execution time, then the same check for path overlap and resulting timing adjustments must be made. This process continues for the decreasing longest paths to the producer node until the length of the k-th longest path to the producer is less than the longest path to the consumer node (assuming minimum execution times) or it is found that a barrier must be inserted.
Let b c be the nearest common dominator for barriers bg and bi, where bg is LastBar(ng) and b i is LastBar(ni). Recall that n i is being scheduled, ng is a producer for ni, and it is necessary to determine if a barrier is required between these instructions.
The relationship between the various path lengths can be expressed as
where ~Jmax(b~, bg), 2 < j <_ k represents thej-th longest path (assuming maximum execution times) from br to bE. 
is satisfied, consider the next longest path _max,c~J+lth, bE) and repeat the process. If the condition is not met, then a barrier must be inserted as described in the conservative barrier insertion algorithm, and the scheduling algorithm starts again with the next node in the list. This process of successively checking the j-th longest paths continues until
l(~lJmax(bc, bE)) + tSmax(ng ) -< l(~lmin(bc, bi) ) + ~nfln(nT)
is met for the k-th longest path; that is, j = k, proving that the synchronization is satisfied statically and no barrier is required.
To modify the conservative barrier insertion algorithm ( Figure 15 ) to implement these tests, the following code from the algorithm if min consume >_ max___produce then return;/* static time constraints resolve the synchronization */ else/* insert a barrier before n i and after ng */
if min consume >_ max_._.produce then return;/* static time constraints resolve the synchronization */ else if CheckPathOverlap(l___path cg, b i, bE, b c, diff, threshold) = TRUE then return;/* path overlap shows sync resolved statically */ else/* insert a barrier before n i and after ng */ Function CheckPathOverlap in the modified algorithm checks for path overlap on successive k-th longest paths. CheckPathOverlap is given in Figure 16 . Note the third parameter in function LongestPathMin is the path l__path cg; the function returns the longest path assuming minimum time for edges except along path l___path cg, where maximum times are used. Function KthLongestPaths(bc, bE, threshold) returns a sorted list of the k-th longest paths between bc and bg with length greater than or equal to threshold. These paths can be computed using an algorithm originally developed by Hoffman and Pavley [1959] to find the k-th shortest paths in a graph. This algorithm was refined and improved by Dreyfus [1969] and Fox [1973] . The basic approach is to fmd the set of deviations from the shortest path, where a deviation is defined as a path that coincides with the shortest path from its origin up to some node j on the path, then deviates directly to some node k not the next function CheckPathOverlap (l_path_cg, bi, bg, b c ,diff, threshold) /* find longest path from bc to b i assuming maximum time for arcs on l_.path_cg */ l path_ci bi, l_.path_cg Figure 16 Algorithm for checking path overlap in a barrier DAG.
node on the shortest path, and finally proceeds from k to the fixed terminal node via the shortest path from k. These algorithms are designed to find the k-th shortest paths but modifying them for longest paths is straightforward. Fox [1973] found the time complexity of the problem to be N21ogN + 2kNlogN in an N node graph, or O(s21ogs + kslogs) in terms of the barrier DAG with O(s) barriers. Thus, node scheduling and assignment using optimal barrier insertion has worst-case time complexity O(m(s21ogs + kslogs), compared to O(ms 2) for the conservative algorithm. Note that the average running times should be smaller. For example, the dominator computation could presumably reuse information when updating the dominator tree after a barrier is inserted. Also, some longest path information may be reusable after a new barrier is added to the DAG.
Barrier merging.
An additional merging step is performed when inserting barriers into an SBM schedule. If the execution-time range of the new barrier overlaps with any other barriers currently schedules, and if the overlapping barriers are not ordered with respect to the barrier DAG, then they are merged into a single barrier. For one set of benchmarks studied (with 10 variables and 80 statements) the merging of barriers yielded 35 % fewer barriers in the resulting schedules. The static scheduling fraction also increased as a result of the larger barriers in the SBM schedule. The merging of barriers increased the completion time for the SBM compared to the DBM, although these times are quite close for the benchmarks studied. More scheduling results are given in the next section.
Scheduling Experiments
The scheduling algorithms discussed in the last section 6 were applied to the synthetic benchmark programs. The effects of varying different parameters that are related to the architecture of the machine and the structure of the synthetic benchmarks have been studied. Architecture parameters that were varied include the number of processors and timing assigned to each instruction; barriers were assumed to always execute immediately upon arrival of the last participating processor. Benchmark parameters included the number of instructions and variables in generated programs. Particular attention has been paid to the different synchronization fractions and how they vary as the parameters change. These results have provided good feedback concerning the performance of the scheduling algorithms. One hundred synthetic benchmarks were generated for each set of parameters and the results averaged to yield each point on the curves shown in this section. Thousands of benchmarks were generated for between 2 and 128 processors with 2 to 15 variables and 5 to 100 assignments. Some benchmarks resulted in code with unrealistically few synchronizations; 7 hence, we excluded these cases from our analysis. For the remaining programs the results were within the following ranges:
9 The barrier fraction varies from 0.9% to 25%. 9 The serialization fraction varies from 39% to 96%. 9 The fraction of barriers statically scheduled away varies from 2% to 50%.
Note that the last fraction represents a feature unique to barrier architectures. A scatter plot with the serialization fraction on the vertical axis and the statically scheduled fraction on the horizontal axis is shown in Figure 17 . The results for more than two thousand of the synthetic benchmarks are given in the figure, and it can be seen that, on average, about 88 % of the synchronizations for the benchmarks in the plot are either serialized or statically scheduled away.
The following few sections examine the impact of various program and machine characteristics on the performance of the timing analysis and barrier scheduling. Additional relationships are discussed in [Zaafrani 1990 ].
Number of Assignment Statements
In this section the number of processors and variables is fixed (8 processors and 15 variables), while the number of instructions varies. This section investigates the effect of changing the number of instructions in the synthetic benchmarks.
As can be seen in Figure 18 , the fraction of barrier synchronization decreases as the number of statements in the generated basic blocks increases. This decrease is relatively large when the number of statements varies from 5 to 20. Generally speaking, the Load operations are scheduled for early execution. Since Load is a variable execution-time instruction (from one to four time units), barriers are often required after a Load. Hence, there is a concentration of barriers at the beginning of the execution of the basic block. The barrier fraction decreases when we increase the number of statements because the percentage of Loads becomes smaller. This effect is counterbalanced at larger benchmark sizes because Mu I, D i v, and MoO instructions begin appearing in generated benchmarks, and these instructions have large execution time variation.
Notice that the serialization fraction decreases as the benchmark size increases. Larger benchmarks make it less likely that a consumer can be placed directly after a producer because another instruction has probably been scheduled there already. 
Number of Variables
In this section the benchmark size and the number of processors are fixed (60 statements and 8 processors), while the number of variables is changed. Referring to Figure 19 , as the number of variables increases, the barrier fraction first increases, then remains constant. Since the parallelism width is within a constant factor (-0.65) of the number of variables, the parallelism width increases with the number of variables. Hence, the scheduling algorithm employs more processors in the schedule as the number of variables increases. The result is that more barriers are generated until the parallelism width of the benchmark exceeds the number of processors, and the barrier fraction then almost becomes constant.
The serialization fraction decreases when more variables are used because for a large number of variables, the parallelism width is large, and fewer opportunities for serialization exist.
Number of Processors
In this section the benchmark size and the number of variables are fixed, while the number of processors is varied. For two variables, increasing the number of processors did not affect the barrier fraction because the scheduling algorithm keeps almost all of the instructions in two processors. The serialization and static scheduling fractions also remain constant.
For five variables, when the number of processors is increased from two to four the barrier fraction increases because more synchronization is required between the different processors. The barrier fraction stabilizes after four processors because the scheduling algorithm employs no more than four processors. For N variables the barrier fraction increases when the number of processors employed is less than the parallel width. When the number of processors used is greater than the parallelism width, the barrier fraction remains constant. Figure 20 illustrates the effect of increasing the number of processors on the different synchronization fractions for a barrier machine. This figure is for 100-assignment statement basic blocks with 10 variables. As can be seen in the figure, the serialization fraction remained nearly constant as the number of processors increased. There are two effects contributing to this serialization rate behavior that tend to cancel each other out, resulting in a fixed serialization fraction. The first effect is that for small numbers of processors, the scheduling algorithm often inadvertently serializes a synchronization operation. Conversely, the scheduling algorithm has fewer opportunities to serialize for a barrier machine with a small number of processors, because quite often the "serialization slot" will already be filled. The first effect results in the serialization fraction for larger numbers of processors while the second effect yields an increase in the serialization fraction.
Analysis of the Heuristics
The heuristics employed in code scheduling were analyzed by first modifying them in some way and then scheduling the same synthetic benchmarks using the original and new version of the heuristic. Various characteristics of the resulting schedules, including execution time and synchronization fractions, were compared to gain insights into the performance of each heuristic.
The node assignment step was modified to perform simple round-robin scheduling, where the i-th node in the sorted list is scheduled on processor (i modulo N), where N is the number of processors. As expected, the serialization fraction nearly vanished for large numbers of processors; the barrier fraction also increased significantly, in some cases reaching 50 %. Both the minimum and maximum execution times increased, although the executiontime difference between list scheduling and pure round-robin became smaller for larger numbers of processors. These results suggest that the node assignment heuristics employed in the scheduling algorithm were reasonable.
Another change to the node assignment heuristic switched the ordering priority: The minimum height hmi~ was employed first in the sort of the instruction nodes; the maximum height was then employed to break ties. As expected, the minimum execution time of the benchmarks decreased while the maximum time increased. This is not surprising because the new ordering, in some sense, attempts to optimize the minimum execution time. However, the changes were quite small.
An attempt was made to increase the serialization rate by performing lookahead on the sorted instruction list. Instructions in a window of size p on the top of the list were examined before a node was assigned to a processor to see if such an assignment would preclude a later serialization assignment. The effects were interesting: The serialization fraction increased as expected, but not very much for large numbers of processors, because the original scheduling algorithm without lookahead keeps the inherently serial instruction streams on a single processor. For small numbers of processors the execution time increased from 10 % to 30 % due to an increase in the critical path length brought on by the additional serializations. This increase disappears for a large number of processors.
Another parameter of interest is the timing variation of the instructions. Synthetic programs were generated with instructions having very large timing variations. The results showed that the barrier synchronization fraction was not very sensitive to increases in instruction timing variation, increasing only slightly for large variations. In the VLIW execution mode all instructions were assumed to require their maximum time to execute. No asynchrony was allowed in VLIW execution. As can be seen in the figure, the maximum times for both the barrier MIMD and VLIW were nearly identical. Execution time, as displayed in Figure 21 , has been normalized to VLIW execution time. The barrier machine took slightly longer to complete execution for smaller numbers of processors because more barriers were required due to instruction timing variation.
The minimum barrier MIMD completion time was about 25 % lower than the VLIW completion time, and the average barrier completion time will fall between the minimum and maximum times, the exact value being determined by the probability distributions of the variable execution-time instructions. It should be noted that an optimal schedule (completion time equal to the critical path time) was determined for almost all the synthetic benchmarks for the comparison.
These results suggest that it is important to reduce the maximum execution time for instructions, and this is traditionally done in VLIWs by adding extra hardware for such operations as integer and floating point multiplies. Also, memory reference delays are reduced by having multiple paths to memory banks and through careful scheduling of memory references to avoid bank conflicts [Colwell et al. 1988 ].
Conclusions
VLIW architectures have proven to be capable of providing consistently good performance over a larger range of programs than vector processors; the barrier MIMD architectures hold great promise for extending this range even further. Whereas VLIW architectures cannot achieve efficient parallel execution of wh i l e loops, subroutine calls, and variable executiontime instructions, barrier MIMDs provide a hardware mechanism that allows VLIW-like static scheduling to be applied to all these constructs.
Barrier MIMD hardware has been shown to be easily and efficiently implementable Dietz 1990a, 1990b] ; hence, the prime question is whether a reasonable approximate solution to the NP-complete problem of static code scheduling for such machines can be found. In this paper we have presented a set of algorithms that efficiently implements this code scheduling. Further, we have shown the proposed algorithms to perform very well, citing the results from applying these algorithms to over thirty-five hundred synthetic benchmark programs. Note that we are not claiming huge performance increases on VLIW code, but rather the ability to achieve VLIW-like performance for a much wider range of program constructs.
Ongoing work includes the extension of the basic scheduling techniques to more complex code structures (including arbitrary control flow) and the application of the barrier scheduling techniques to remove some synchronizations in conventional MIMD architectures ]. We are also working toward a complete machine design, the CARP (Compiler-oriented Architecture Research at Purdue) machine [Dietz, Siegel et al. 1989 ] that incorporates the barrier MIMD mechanism as well as various other novel compiler/ architecture interactions.
