Abstract
Address all correspondence to this author
The performance of a VLIW machine depends strongly on the ability of the compiler to exploit instruction level parallelism (ILP) in programs. Unfortunately, the task of the compiler is made difficult by the presence of a number of intractable optimization problems such as instruction scheduling and register allocation. In this paper, we study one such problem -scheduling with profile information. We believe that our results will serve as a good startingpoint for developing practical heuristics that are to be included in compilers for VLIW machines. As evidence, we present experimental results validating heuristics suggested by our analysis.
A basic block is a program fragment that may only be entered at the top and exited at the bottom. The precedence graph of a single basic block will be a directed acyclic graph (DAG) [2] , and in practice, typically consists of fewer than 10 vertices. Scheduling small basic blocks consecutively and separately leads to underutilization of the functional units due to sequentialization effects at the block boundaries. To overcome this limitation two broad approaches have been proposed. One called if-conversion, eliminates the branches via hardware support for predicated execution, allowing instructions to be moved outside of their basic blocks, see [4] for instance. The other approach does not require hardware support, and involves the formation of larger code regions such as traces, [7] , and super blocks, [14] . A super block consists of a sequence of basic blocks strung together, with conditional exits at the branch points that separate the basic blocks. Super blocks are typically formed as follows. Given is a code region with branch probabilities available at each branch in the region. These probabilities can be obtained by profiling -collecting usage statistics when the code is executed. Starting at the entry point to the code region, follow the dominant fork at each branch. If at a certain branch, the probability of reaching the branch from the start falls below a certain threshold level, terminate the path. Alternatively, if neither fork in the branch is predom-inant, terminate the path. The chain of basic blocks along the path traced in this manner is a super block. Delete the chain of basic blocks defining the super block from the code region and repeat the process on the modified portion. Since some of the basic blocks that are deleted might have entries into them from other blocks in the code region, these blocks must be duplicated in the modified region. This process is called "tail duplication." It is clear that the amount of tail duplication will affect the size of the overall code, thereby affecting performance in the face of fixed instruction cache sizes. In light of this, the threshold and the parameters defining the notion of predominance are empirically determined and are outside the scope of this paper.
Much of the literature on scheduling ignores profile information, under the assumption that the region formation algorithm has fully digested the profile information. For instance, super blocks are typically scheduled with heuristics that are oblivious to the profile information, although there are techniques that use profile information as an addendum to classical scheduling techniques, e.g., the speculative yield technique of [3] , following [7] . (Example 1 examines several of these techniques.) As region formation algorithms become more sophisticated and produce non-linear code regions encompassing balanced branches, they will be less effective in digesting the profile information. As a result, it will be increasingly important that the code-region be annotated with profile information, and that the scheduling technique effectively utilize this profile information. We believe that our paper offers a first step in this direction.
One might argue that it would be better to tackle the problem of profile driven scheduling from first principles, by treating general code regions, rather than processed code regions such as super blocks. However, there are two good reasons for restricting our study to processed regions. (1) as we will see shortly, profile-driven scheduling of generic code regions is computationally intractable and is impractical unless the number of branches in the region is small, say fewer than 16; and, (2) tail duplication during the formation of simpler regions such as super blocks often exposes more instruction level parallelism than is extant in the generic code region. Thus, it is a legitimate goal to study good scheduling heuristics even for the limited case of super blocks.
We now make precise our abstraction of the general problem of scheduling with profile information. We are given a directed acyclic precedence graph derived from the source program as described above. Each vertex in the graph represents an operation ri with specified execution time t i , which is the time required to execute ri. Each vertex also carries a weight wi. The weight wi is the probability that the program exits at vertex i. In other words, wi is the probability that only the portion of the precedence graph rooted at vertex i needs to be computed, i.e., only those vertices upon which vertex i depends need to be computed. We assume that the target machine has ~n functional units. We are to schedule the vertices of the graph on the m functional units to achieve the lowest cost, i.e., shortest weighted execution time. Specifically, we are to find a sclhedule to minimize xi wi fi, where fi = si + ti is the finish time of operation ri and si is its start time.
The general problem in which every node has a weight has been shown to be NP-hiKd even for m = 1 provided we permit arbitrary precedence constraints on the operations [9, 151. The problem is polynomially solvable when the precedence graph is a fortst [ 121 or a generalized seriesparallel graph [ 1, 151. For 17;. > 1, the ]problem is NP-hard even without precedence constraints, unless the weights are all identical in which case it is polynomially solvable; on the other hand, the problem is strongly NP-hard even when all weights are identical and the precedence graph is a collection of chains [5] . In light of the intractable nature of the problem, we adopt the standard approach of designing approximation algorithms with a boundled performance ratio. The performance ratio of an approximation algorithm is defined as the worst-case ratio of the cost of the approximate solution and the optimal solution.
We begin with a general 1:mma that shows how to construct an optimal sequential schedule for a general precedence graph with weights. The construction of the lemma can be efficiently exploited only for two restricted versions of the problem, the case where the precedence graph is a tree, i.e., each vertex has exactly one [outgoing edge, and the S-graph case where the weights are non-zero only on a single path. The S-graph, to be defined formally in the next section, is the abstraction of the super block. We then show that using the optimal sequential :schedule as a list to drive a list scheduling algorithm for multiple functional units guarantees a performance ratio of 2. Finally, in a heuristic extension to our basic lemma, we present a generic scheme for converting any list scheduling algorithm that is insensitive to profile information into a scheduling algorithm for super blocks that is sensitiv,: to profile information. We cannot show tight performance guarantees on this heuristic, but we do present experimental results on a number of sample super blocks obtained by applying the Impact compiler on SPEC benchmark programs. We report that significant savings are possible as compared to prior methods.
Example 1 Consider the precedence graph of Figure 1 , which is the precedence graph of a super block with three branch vertices, vertices 3,7, and 28 as marked in the$gure. Theprobabilitythat theprogram will exit via vertex 3, is 0.1. Similarly the probability thai it will exit via vertices 7 and 28 are 0.3 and 0.6 respectively. We assume we are given a non-pipelined pmcessor with two identicalfunctional units with all operations having a latency of one cycle. We now schedule this graph on the p,rocessor i ri r three ways. First, we carry out critical path scheduling. This is equivalent to Figure 1 . Precedence graph of a super block. Each vertex corresponds to unit latency operation. The probability labels on the branch exits of the graph are the probabilities that the exits will be taken. 
Cycle

Theoretical Results
Let G = (V, E ) denote the precedence graph. A sink in the graph is a vertex with no outgoing edges. We assume, without loss of generality, that the graph has exactly one sink, since we can easily ensure this by the addition of a dummy vertex with in-edges from the sinks of the given graph. Each vertex i is assigned a weight wi. Let P be a path from a source to a sink in a precedence graph G.
We now define the notion of an S-graph, the graph-theoretic abstraction of a super block. Recall that a super block consists of a chain of basic blocks with conditional exits at the branch points separating the basic blocks. The graph G is said to be an S-graph with respect to P if the weights wj are zero everywhere except on the path P. Without loss of generality, we assume that the weight on the sink is nonzero. If not, we can delete the sink and break the graph into a number of components, retaining only the component containing P . The precedence graph of a super block will contain precedence edges between the branch vertices, since these cannot be executed out of order. Since the branch vertices are the only vertices with non-zero exit probabilities, the precedence graph of a super block is an S-graph.
We say that U immediately precedes v if and only if there is an edge from U to v in the graph. A vertex U precedes a vertex v if and only if there is a path from U to U. For any vertex U E V , let G, denote the subgraph of G induced by the set of vertices preceding U. A subgraph is said to be closed under precedence if for every vertex U in the subgraph, all vertices preceding U are also in the subgraph.
We define therank of a vertex i to be the ratio ri = ti/wi.
Although a vertex can have infinite rank, as will become evident shortly, we will only be interested in those of finite rank. For any set of vertices A C V , we define its weight as w ( A ) = xV,EA wi, and its execution time as t ( A ) = xV,.EA t i ; based on this, the rank is r ( A ) = t ( A ) / w ( A ) .
For instance, the rank of the set of vertices preceding vertex 7 in Figure 1 is 7/.4 = 17.5. The notion of the rank of a set of vertices is meant to capture their relative importance, comparing the sum of their weights to the cost of executing them. Intuitively, the sum of the weights is the contribution made by the set of vertices to the weighted finish time, while the sum of their execution times is the delay suffered by the rest of the graph as a result of scheduling the set of vertices first. As will become evident in our basic lemma, the notion of rank plays a key role in characterizing the optimal sequential schedule.
The Basic Lemma
In this section we develop a basic lemma characterizing optimal sequential schel3ules of weighted precedence graphs, i.e., schedules on a single functional unit. Efficient algorithms for the two case!; where the precedence graph is a tree, and the precedeno: graph is an S-graph, can be obtained as special applications of the basic lemma. Previously, optimal sequential algorithms for weighted trees have been described in the literatuie [ 1, 8, 12] . Applying the basic lemma to general weighted graphs would cost time exponential in the number of vertices with non-zero weights, a cost that can be practical, if the number of such vertices, i.e., the number of branches is small. From a theoretical point of view, the best known approximation algorithm [ 111 for sequential scheduling of weighted DAGS has a performance guarantee of 2; however, this algorithm is based on rounding solutions of linear programs and is impractical for the compiler setting.
The following terms are di:fined with respect to a specific schedule S. We use the tenn segment to refer to a set of consecutive operations in a schedule. TWO segments B'1 and B2 in the schedule are independent if there are no operations U E B1 and v E B2 such thal U precedes v or v precedes U . Given a weighted precedace graph G, we define G' to be the smallest precedence-c losed proper subgraph of G of minimum rank.
We now prove our main h:mma.
Lemma: For any graph G, there exists an optimal sequential schedule where the optimal schedule for G' occurs as a segment which starts at 1 ime zero.
Proof: Let S be an optimal schedule for G in which G* is decomposed into a minimum number of maximal segments. Suppose that G k is decomposed into two or more segments in S. Let S' be the schedule fixmed from S by moving all the Bi's ahead of Ci's while preserving their order within themselves. The schedule S' is legal since G' is precedence closed. We will show that the cost of S' is no more than that of S which will finish the proof. While comparing the costs of the two schedules, we can ignore the contribution of the vertices that come after Bk since their status rernains the same in S'. For the schedule S' we have Q and BJ is smaller than G'.
super block in control-flow order, exactly the successive retirement schedule given in Example 1.
Cost(S') = W ( C i ) t ( B k ) + W(Ci)t(Ci) +
We can also obtain good ILP schedules for super blocks from the sequential schedule. Specifically, we can show that list scheduling using the optimal sequential schedule as the list, gives good approximate solutions for S-graphs. We defer the proof to the full paper. equal execution time, the list scheduling algorithm, using the optimal sequential schedule as the list, is an approximation algorithm with a performance ratio 2.
Taking their difference gives,
The Practical Heuristic
.
Notice that our theorem for profile-driven ILP scheduling above is quite limited since it requires that all operations have equal execution time. For the practical situation, we offer a quality heuristic, based on our theoretical analysis from the earlier sections.
The second inequality above follows from our earlier observations about r ( d ) and r ( B -B j ) . The third step follows from a simple reordering of the order of summation.
The Modified Rank Function
Schedules
The main lemma essentially reduces the scheduling problem to the problem of finding G'. Then, we can recursively schedule 6' and the graph formed by removing G* and put their schedules together to obtain an optimal schedule for the entire graph. Unfortunately, the problem of finding G' for an arbitrary precedence graph is NP-hard. However, if the number of vertices in the graph that have non-zero weight are small, then G' can be feasibly determined by exhaustive enumeration. Next, we show that finding G' and hence finding optimal sequential schedules is relatively straightforward if the precedence graph is an S-graph. Let G be an S-graph with respect to a path P. If G' has a sink not on path P or a sink of zero weight, such a sink can be deleted and both the rank and the number of vertices in G' reduced appropriately. Thus G' must be a subgraph with a single sink, and the sink must be a vertex on P with non-zero weight-determining G* is straightforward. The schedule so obtained is essentially the one obtained by greedily scheduling successive vertices on the path defining the S-graph as early as possible. In terms of the corresponding super block, this amounts to scheduling the basic blocks comprising the In our basic lemma, we computed the rank of a set of vertices to be the sum of the latencies in the set divided by the sum of the exit probabilities. For the single unit case, the numerator is a good measure of the length of the time required to compute the set of vertices. Extending our notion of rank from the sequential setting to the ILP setting, we will replace the numerator by the length of the schedule to compute the set of vertices. We call this the modified rank or the mrank of a set of vertices, where length of schedule for A sum of exit probabilities in A' mrank(A) = As in the basic lemma, G* is defined to be the smallest precedence closed subgraph of G of minimum modified rank. The intuition behind the modified rank is that the numerator is the time required to retire A, while the denominator is the benefit in retiring A. Thus, the ratio reflects the amount of computatational time required per unit of exit probability.
Minimizing this ratio in selecting G' has the effect of maximizing the "return on investment" in the schedule. Given an S-graph, G* can found by the following simple procedure.
Algorithm Finding G* under modified rank
For each branch b of the S-graph, On the given processor, construct a list schedule for the subgraph Gb rooted at b, ignoring profile information Let T be the length of the schedule and let W be the sum of the exit probabilities of all exits in Gb.
G* is Gb for the earliest b in control order that has the minimum modified rank.
rank(Gb) = T/W 
The Heuristic
Now that we know how to compute G* under modified rank, we can proceed to the scheduling heuristic, given below. In words, the heuristic converts any list scheduler for precedence graphs, to one that is sensitive to profile information for S-graphs. In this sense, the heuristic takes a profile-insensive list scheduling algorithm, and bootstraps it to be profile-sensitive. To start, the heuristic finds G* under modified rank, using the insensitive list scheduler. It then makes the list for G* the initial portion of the list for G. The heuristic deletes G* from G, and iterates, appending the lists each time till all of G is consumed. Figure 1 , and apply our scheduling heuristic to it. We will use critical path scheduling as the insensitive list scheduler that is oblivious to profile information. Once again, we assume a processor with two identicalfunctional units and equal latencies for all vertices. We have 3 candidates for G* initially consisting of the subgraph rooted at vertex 3, denoted by G3. the subgraph rooted at vertex 7, denoted by GI and the entire graph G. Computing their modijied ranks, we get rank(G3) = 
Algorithm
Experimental Results
We now study the performance of the heuristic on a number of optimized super block; generatedl by the Impact compiler from the SPEC benchmark programs. We restrict our attention to integer benchmapks, since broadly speaking the floating-point benchmarks yield super blocks with near-zero side-exit probabilities, [ 161. We used Ihe Impact compiler to compile these benchmarks, decomposing each program into super blocks and basic Iilocks onhy. We report our results on scheduling these blocks over twlo different classes of machine models, processors with uniform functional units, and processors with heterogmous functional units. All the models are non-pipelined with opcode execution times of as specified in Table 6 . The assumption that the machines are not pipelined is only in the interest of simplicity, and is not an inherent limitation of our technique. The uniform processor models have 2,4, and 8 identical functional units respectively, and are denoted ~2 , 2 1 4 ancl u g . While uniform machine models are unrealis1 ic in practice, they serves well to study the effect of scaling the number of functional units in a processor. The heterogesous models h3, h5 and h g are as shown in 5. Model h3 has one IALU one FALU and one 0.8 0.9 0.6 0.6 0.8 Table 6 . Opcodes and execution times.
cccp cmD loadstore unit, model h5 has two IALUs, one FALU and two loadstore units, while model h8 has three IALus, two FALU and three MEM (loadstore) units. We assume that BRANCH operations can be performed on the FALU. First, we use the critical path scheduler as the profileinsensitive scheduling algorithm to drive our heuristic, and compare its performance against three algorithms: (1) critical path scheduling from the last exit; (2) speculative yield as in Example 1 and [3] ; (3) successive retirement as in Example 1. Table 7 shows the improvements achieved by our heuristic over critical-path scheduling for the benchmarks studied over the various machine models. For each benchmark and machine model, we show the improvement in the total schedule length of the benchmark. Formally, we define the total schedule length of a benchmark to be the weighted sum of the schedule lengths for all the basic blocks and super blocks for that benchmark, where the weights are the execution frequencies obtained via profiling. In our experience, this is a good measure of the run-time of the benchmark on a typical machine with a sufficient number of registers. Table 8 shows the improvements achieved by our heuristic over speculative yield scheduling, and Table 9 shows the improvements achieved by our heuristic over successive retirement scheduling. Referring to Table 9 , notice that on narrow machines such as u2 and h3, little performance gain is evidenced. This is because successive retirement is optimal on the sequential processor as shown in our theoretical analysis, and is likely a good schedule on narrow machines.
To substantiate our claim that our heuristic is a general paradigm for converting a profile-insensitive scheduler to profile-sensitive one, we apply our heuristic using successive retirement as the profile insensitive scheduler, over the same set of benchmarks and machine models. The results are shown in Table 10 .
Discussion
In our performance studies above, the total schedule length of a benchmark depends on the nature and mix of basic blocks and the super blocks produced during compilation. Our heuristic is designed to improve the performance of super blocks that have side exits with substantial exit frequency. If the compiler is not aggressive in creating such super blocks, or if side exits occur very infrequently, the opportunities for performance gains are limited. To examine this in detail, we introduce the notion of the critical path ratio of a super block, which aims to measure the relative importance of the side exits of a super block. To this end, we define the expected critical path length as the weighted sum of the lengths of the critical paths of all the exits, weighted by their exit probabilities. The critical path ratio is the ratio of the expected critical path length to the length of the critical path of the last exit. If the critical path ratio is small compared to unity, then the side exits are significant, and if the critical path ratio is close to unity, then the last exit is predominant. It is clear that wery basic block will have a critical path ratio of unity. Figure 2 shows the average improvement achieved by our heuristic over critical path scheduling, as a function of the critical path ratio. The plots represent averages over all the basic blocks and super docks obtained from compiling the benchmarks studied. f i e plots imarked u2, u4 and us in the figure refer to the respective uniform processor models. As an example of how to read these plots, observe that blocks with critical path ratio of 0.2 enjoy a 30% improvement on average when scheduled by our heuristic, as compared to scheduling by critical path from the last exit, with respect to the two-functimal unit machine u2. As, the critical path ratio nears unit], the achieved improvement falls off, as is to be expected since in this case the last exit is predominant, and our heuristic converges to critical path scheduling. Notice also that as the number of available functional units increases from U? to u4 and us, the achieved improvement falls off. This is because criitical path scheduling is increasingly good for wider processors, (optimal in the limiting case of infinitely wide processors), and there is reduced opportunity for performance gains by rearranging the schedule. Also shown in the figure is the distribution of the blocks, depicted as cumulative percentage against critical path ratio. As an example of how to read this plot, observe that roughly 30% of the blocks in the sample have a critical path ratio of 0.8 or less. At this value of critical path ratio, the performance improvement is down to a few percent. Hence, the remaining 70% of the blocks are not good candidates for improvement via our scheduling heuristic. This suggests that if super block formation heuristics could form super blocks with lower critical path ratios, our scheduling algorithm would have increased opportunity for performance gains.
Another factor that affects the performance gains realized by our scheduling heuristic is the amount of parallelism present in a super block. If a super block has little parallelism, then the critical path schedule will not saturate the processor, and little performance gain can be obtained since the schedule is not constrained by resources. On the other hand, if the super block has a lot of parallism in it, the critical path schedule will saturate the processor, and much performance gain can be had by rearranging the schedule in favor of high probability exits. A good measure of the parallelism available in a block is the processor utilization factor of the schedule for the block. This is essentially the the average load on the processor during the schedule expressed as a percentage. Formally, the procesor utilization factor is the number of cycles for which each functinal unit is busy, summed over all functional units, and expressed as a percentage of the product of the length of the schedule and the number of functional units.
Thus we have two independent factors that can affect the gains realized by our scheduling heuristic (1) The importance of the side exits as reflected in the critical path ratio and (2) the amount of parallelism available as reflected in the processor utilization. We will now examine the results of Table 7 in light of these two factors. To do so, let us extend the notion of the critical path ratio to benchmarks-the critical path ratio of a benchmark is the weighted sum of the critical path ratios of the blocks composing it, where the weights are the execution probabilities of each block. Similarly, the utilization factor of a benchmark is the weighted sum of the utilization factors of the blocks composing it, where the weights are the execution probabilities of each block. Our heuristic should perform well when the critical path ratio is small and the utilization factor is large. We test this hypothesis in Figure 3 . The horizontal axis in the plot is the cntical path ratio and the vertical axis is the processor utilization for critical path schedule on processor model u4. Each box in the figure represents a benchmark, with the center of the box corresponding to its critical path ratio and processor utilization on the horizontal and vertical axes respectively.
The length of the side of each box is directly proportional to the improvement achieved by our heuristic on the benchmark, corresponding to the entry in column u4 of Table 7 . The benchmarks that have a high critical path ratio enjoy 
1.00
Critical path ratio very little performance gain independent of their processor utilization. These benchmarks are shown as small box. Also, benchmarks that have little parallelism, manifested as low processor utilization, enjoy very little performance gain even if they have low critical path ratio. Thus, the performance gains of Table 7 are well explained by our intuition, and lends support to the conclusion that the heuristic exhibits gains where gains are possible.
Conclusion
We presented a theoretical analysis of the general problem of scheduling with profile information-we gave a general lemma characterizing optimal sequential schedules for a weighted precedence graph. In a heuristic extension to this lemma, we presented a generic scheme for converting profile-insensitive list scheduling algorithms into profilesensitive scheduling algorithms for super blocks. Experiments show that in some settings, our heuristic can offer substantial performance improvement over prior methods on a range of benchmarks. The sensity of the heuristic to the profile data remains a topic for study. 
