Abstract. Instruction scheduling is a necessary step in compiling for many modern microprocessors. Traditionally, global instruction scheduling techniques have outperformed local techniques. However many of the global scheduling techniques described in the literature have a side effect of increasing the size of compiled code. In an embedded system, the size of compiled code is often a critical issue. In such circumstances, the scheduler should use techniques that avoid increasing the size of the generated code. This paper explores two global scheduling techniques, extended basic block scheduling and dominator path scheduling, that do not increase the size of the object code, and in some cases may decrease it.
Introduction
The embedded systems environment presents unusual design challenges. These systems are constrained by size, power, and economics; these constraints introduce compilation issues not often considered for commodity microprocessors. One such problem is the size of compiled code. Many embedded systems have tight limits on the size of both ram and rom. To be successful, a compiler must generate code that runs well while operating within those limits.
The problem of code space reduction was studied in the 1970's and the early 1980's. In the last ten years, the issue has largely been ignored. During those ten years, the state of both processor architecture and compiler-based analysis and optimization have changed. To attack the size of compiled code for embedded systems, we must go back and re-examine current compiler-based techniques in light of their impact on code growth. This paper examines the problem of scheduling instructions in a limitedmemory environment. Instruction scheduling is one of the last phases performed by modern compilers. It is a code reordering transformation that attempts to hide the latencies inherent in modern day microprocessors. On processors that support instruction level parallelism, it may be possible to hide the latency of some high-latency operations by moving other operations into the "gaps" in the schedule.
Scheduling is an important problem for embedded systems, particularly those built around dsp-style processors. These microprocessors rely on compiler-based instruction scheduling to hide operation latencies and achieve reasonable performance. Unfortunately, many scheduling algorithms deliberately trade increased code size for improvements in running time. This paper looks at two techniques that avoid increasing code size and presents experimental data about their effectiveness relative to the classic technique-local list scheduling.
For some architectures, instruction scheduling is a necessary part of the process of ensuring correct execution. These machines rely on the compiler to insert nops to ensure that individual operations do not execute before their operands are ready. Most vliw architectures have this property. On these machines, an improved schedule requires fewer nops; this can lead to a direct reduction in code space. If, on the other hand, the processor uses hardware interlocks to ensure that operands are available before their use, instruction scheduling becomes an optimization rather than a necessity. On these machines, nop insertion is not an issue, so the scheduler is unlikely to make a significant reduction in code size.
In this paper, we focus on the vliw-like machines without hardware interlocks. (Of course, good scheduling without code growth may be of interest on any machine.) For our discussion, we need to differentiate between operations and instructions. An operation is a single, indivisible command given to the hardware (eg. an add or load operation). An instruction is a set of operations that begin execution at the same time on different functional units.
Traditionally, compilers have scheduled each basic block in the program independently. The first step is to create a data precedence graph, or dpg, for the block. Nodes in this graph are operations in the block. An edge from node a to node b means that that operation b must complete its execution before operation a can begin. That is, operation a is data dependent on operation b. Once this graph is created it is scheduled using a list scheduler [17, 11] .
Since basic blocks are usually rather short, the typical block contains a limited amount of instruction-level parallelism. To improve this situation, regional and global instruction scheduling methods have been developed. By looking at larger scopes, these methods often find more instruction-level parallelism to exploit. This paper examines two such techniques, extended basic block scheduling (ebbs) and dominator path scheduling (dps). Both methods produce better results than scheduling a single basic block; this results in fewer wasted cycles and fewer inserted nops.
We selected these two techniques because neither increases code size. In the embedded systems environment, the compiler does not have the luxury of replicating code to improve running time. Instead, the compiler writer should pay close attention to the impact of each technique on code size. These scheduling techniques attempt to improve over local list scheduling by examining larger regions in the program; at the same time, they constrain the movement of instructions in a way that avoids replication. Thus, they represent a compromise between the desire for runtime speed and the real constraints of limited memory machines.
Section 2 provides a brief overview of prior work on global scheduling. In section 3 we explain in detail the two techniques used in our experiments: namely extended basic block scheduling (ebbs) and dominator-path scheduling (dps). Section 4 describes our experiments and presents our experimental results.
Global Scheduling Techniques
Because basic blocks typically have a limited amount of parallelism [20] , global scheduling methods have been developed in the hopes of improving program performance. All the global techniques we will be describing alter the scope of scheduling, and not the underlying scheduling algorithm. Each technique constructs some sequence of basic blocks and schedules the sequence as if it were a single basic block. Restrictions on moving operations between basic blocks are typically encoded in the dpg for the sequence.
The first automated global scheduling technique was trace scheduling, originally described by Fisher [8] . The technique has been used successfully in several research and industrial compilers [7, 18] . In trace scheduling, the most frequently executed acyclic path through the function is determined using profile information. This "trace" is treated like a large basic block. A dpg is created for the trace, and the trace is scheduled using a list scheduler. Restrictions on interblock code motion are encoded in the dpg. After the first trace is scheduled, the next most frequently executed trace is scheduled, and so on. A certain amount of "bookkeeping" must be done when scheduling a trace. Any operation that moves above a join point in the trace must be copied into all other traces that enter the current trace at that point. Likewise, any operation that moves below a branch point must be copied into the other traces that exit the branch point, if the operation computes any values that are live in that trace.
One criticism of trace scheduling is its potential for code explosion due to the bookkeeping code. Fruedenberger, et al., argue that this does not arise in practice [10] . They show an average code growth of six percent for the SPEC89 benchmark suite and detail ways to avoid bookkeeping (or compensation) code altogether. Restricting the trace scheduler to produce no compensation code only marginally degrades the performance of the scheduled code.
Hwu, et. al., present another global scheduling technique called superblock scheduling [14] . It begins by constructing traces. All side entrances into the traces are removed by replicating blocks between the first side entrance and the end of the trace. This tail duplication process is repeated until all traces have a unique entry point. This method can lead to better runtime performance than trace scheduling, but the block duplication can increase code size. Several other techniques that benefit from code replication or growth have been used. These include Bernstein and Rodeh's "Global Instruction Scheduling" [4, 2] , Ebcioglu and Nakatani's "Enhanced Percolation Scheduling" [6] , and Gupta and Soffa's "Region Scheduling" [12] .
In this section we look at two non-local scheduling techniques specifically designed to avoid increasing code size, namely dominator-path scheduling (dps), and extended basic block scheduling (ebbs). We assume that, prior to scheduling, the program has been translated into an intermediate form consisting of basic blocks of operations. Control flow is indicated by edges between the basic blocks. We assume this control flow graph (cfg) has a unique entry block and a unique exit block.
Extended basic block scheduling
Little work has been published on scheduling over extended basic blocks. Freudenberger, et. al. show some results of scheduling over extended basic blocks, but only after doing some amount of loop unrolling [10] . Since we are striving for zero code growth, such loop unrolling is out of the question.
An extended basic block (or ebb) is a sequence of basic blocks, B 1 , . . . , B k , such that, for 1 ≤ i < k, B i is the only predecessor of B i+1 in the cfg, and B 1 may or may not have a unique predecessor [1] . For scheduling purposes, we view extended basic blocks as a partitioning of the cfg; a basic block is a member of only one ebb.
The first step in ebbs is to partition the cfg into extended basic blocks. We define the set of header blocks to be all those blocks that are the first block in some ebb. Initially, our set of headers consists of the start block and all blocks with more than one predecessor in the cfg. Once this initial set of headers is computed, we compute a weighted size for each basic block. The size for a header is set to zero. The size for all other blocks equals the total number of operations in the block weighted by their latencies plus the maximum size of all the block's successors in the cfg. To construct the ebb's, we maintain a worklist of header blocks. When a block B is pulled off the worklist, other blocks are added to its ebb based on sizes computed earlier. The successor of B in the cfg with the largest size is added to B's ebb. The other successors of B are added to to the worklist to become headers for some other ebb. This process continues for the new block, until no more eligible blocks are found for the current ebb. For each ebb, a dpg is constructed, and the ebb is scheduled with a list scheduler.
We must prohibit some operations from moving between the blocks of an ebb. Assume a block B 1 has successors B 2 , B 3 , . . . , B n in the cfg. Further assume that B 2 is placed in the same ebb as B 1 . We prohibit moving an operation from B 2 to B 1 , and vice versa, if that operation defines a value that is live along some path from B 1 to B i where i = 2. We call this set of values path-live with respect to B 2 , or P L B2 . The set is a portion of the set liveout(B 1 ) as computed by the following equation:
Intuitively, we can't move the operation if any value it defines is used in some block other than B 1 or B 2 and that block is reachable from B 1 via some path not containing B 2 . The operations that can be moved are called partially dead if they are in B 1 [16] .
Dominator-path scheduling
dps was originally described in Sweany's thesis [21] . Other work was done by Sweany and Beaty [22] , and Huber [13] .
We say a basic block B 1 dominates block B 2 if all paths from the start block of the cfg to B 2 must pass through B 1 [19] . If B 1 dominates B 2 , and block B 2 executes on a given program run, then B 1 must also execute. We define the immediate dominator of a block B (or idom(B)) to be the dominator closest to B in the cfg. Each block must have a unique immediate dominator, except the start block which has no dominator. Let G = (N, E) be a directed graph, where the set N is the set of basic blocks in the program, and define
Since each block has a unique immediate dominator, this graph is a tree, called the dominator-tree. A dominator-path is any path between two nodes of the dominator-tree.
We
now define two sets, idef(B) and iuse(B). For a basic block B, idef(B) is the set of all values that may be defined on some path from idom(B) to B (not including B or idom(B).) Likewise iuse(B)
, is the set of all values that may be used on some path from idom(B) to B. The algorithm for efficiently computing these sets is given by Reif and Tarjan [23] .
dps schedules a dominator-path as if it were a single basic block. First, the blocks in the cfg must be partitioned into different dominator-paths. Huber describes several heuristics for doing path selection and reports on their relative success. We use a size heuristic similar to the one described above for ebbs. This is done via a bottom-up walk over the dominator-tree. The size of a leaf equals the latency-weighted number of operations in the block. For all other blocks, size equals the latency-weighted number of operations in the block plus the maximum size of all the block's children in the dominator-tree. When building the dominator-paths, we select the next block in the path by choosing the child in the dominator-tree with the largest size. All other children become the first block in some other dominator-path. Once the dominator-paths are selected, a dpg is created for each path, and the path is scheduled using a list scheduler. After each path is scheduled, liveness analysis and the idef and iuse sets must be recomputed to insure correctness.
When the compiler builds the dpg for the dominator-path, it adds edges to prevent motion of certain operations between basic blocks. Assume B 1 is the immediate dominator of B 2 . Sweany's original formulation prohibited moving an operation from B 2 up into B 1 if that operation defined a value in idef(B 2 ) ∪ iuse(B 2 ), or if it referenced a value in idef(B 2 ). That is, we don't want to move an operation that defines a value V above a use or definition of V in the cfg. Likewise, an operation that references V is not allowed to move above a definition of V . This strategy is safe when B 1 dominates B 2 and B 2 postdominates B 1 .
However, Huber showed that in the general case this strategy is unsafe. Figure 1(a) 
cannot be moved up from B 2 into B 1 . However, we have found that this formulation too is unsafe. Figure 1 (b) demonstrates the problem. Again, assume that blocks B 1 and B 2 will be scheduled together. Note that r1 ∈ liveout(B 1 ) and r1 ∈ livein(B 2 ) since it is referenced before it is defined in B 2 . Therefore, r1 is not in the set liveout(B 1 ) − livein(B 2 ). Assuming operation 2 does not define anything that causes movement to be unsafe, we can move it up into block B 1 . It would then be legal to move the operation 3 into B 1 . Thus both operations in B 2 could be moved into block B 1 , which would cause operation 4 in block B 4 to potentially get the wrong value. Once a dominator-path is selected for scheduling, no updating of the liveness information is done during the scheduling of that dominator-path. Some sort of incremental update would be one way to solve this problem, since moving operation 2 into B 1 would cause r1's removal from the set livein(B 2 ).
We use an approach that doesn't require incremental updates. What we really want to capture are those values that are live along paths other than paths from B 1 to B 2 . This is fairly straightforward if B 1 is the only parent of B 2 in the cfg; we simply use the path-live notion discussed in the previous section. In other cases, we take the conservative approach and don't allow any operation that defines a value in liveout(B 1 ) to move up. Now we consider motion of an operation in the downward direction. Sweany does not allow an operation to move down the cfg, that is into block B 2 from its dominator B 1 , but he does mention that this could be done if Loops pose additional concerns. We must be careful not to allow any code that defines memory to move outside of its current loop or to a different loop nesting depth.
1 In addition to the restrictions described above, we disallow any operation that defines memory from moving between two blocks if they are in different loops or at different loop nesting levels. Finally, we don't allow an operation that defines registers in liveout(B 1 ) to move between the two blocks.
To summarize, we disallow motion of an operation between block B 2 and its immediate dominator B 1 (forward or backward) if that operation defines a value in the set dontdef. This set is defined in figure 2 . Additionally any operations that use a value in idef(B 2 ) are not allowed to move.
Experimental Results
Our research compiler takes C or Fortran code and translates it into our assemblylike intermediate form, iloc [5] . The iloc code can then be passed to various op-timization passes. All the code for these experiments has been heavily optimized before being passed to the instruction scheduler. These optimizations include pointer analysis for the C codes, constant propagation, global value numbering, dead code elimination, operator strength reduction, lazy code motion, and register coalescing. No register allocation was performed before or after scheduling, as we wanted to completely isolate the effects of the scheduler. After optimization, the iloc is translated into C, instrumented to report operation and instruction counts, and compiled. This code is then run.
A variety of C and Fortran benchmark codes were studied, including several from various versions of the SPEC benchmarks and the fmm test suite [9] . The C codes used are, clean, compress, dfa, dhrystone, fft, go, jpeg, nsieve, and water. All other benchmarks are Fortran codes. clean is an optimization pass from our compiler. dfa is a small program that implements the KnuthMorris-Pratt string matching algorithm. nsieve computes prime numbers using the Sieve of Eratosthenes. water is from the SPLASH benchmark suite, and fft is a program that performs fast-fourier transforms.
A Generic VLIW Architecture
In the first set of experiments, we assume a vliw-like architecture. This hypothetical architecture has two integer units, a floating point unit, a memory unit, and a branch unit. Up to four operations can be started in parallel. Each iloc operation has a latency assigned to it. We assume that the latency of every operation is known at compile time. The architecture is completely pipelined, and nops must be inserted to ensure program correctness. We compare dps and ebbs to scheduling over basic blocks. In each case the underlying scheduler is a list scheduler that assigns priorities to each operation based on the latency-weighted depth of the operation in the dpg. For both dps and ebbs we select which blocks to schedule based on the size heuristic described above. In this experiment, we permit all blocks in a given ebb or dominator-path to be at any loop nesting level. Code is allowed to move between blocks as described above. One additional restriction on code movement is that we do not allow any operations that could cause an exception to be moved "up" in the cfg. We do not allow any divide operations, or loads from pointer memory (iloc's PLDor operations), to move up. Table 1 shows the dynamic instruction counts for our benchmark codes. This value can be thought of as the number of cycles required to execute the code. Both ebbs and dps resulted in faster code than basic block scheduling. Slightly better than fifty per cent of the time dps outperformed ebbs, and a few of these wins were substantial. On average ebbs produced a 6.5 per cent reduction in the number of dynamic instructions executed, and dps produced a 7.5 per cent reduction. Table 2 shows the static instruction counts for the same experiments. This corresponds to the "size" (number of instructions) of the object code. Note that all the object codes have the same number of operations; only the number of instructions changes. dps did better by this metric in roughly the same number of experiments. However, the static and dynamic improvements did not necessarily occur on the same codes. This demonstrates that smaller more compact code does not always results in enhanced runtime performance. On average ebbs reduced static code size by 10.9 per cent and dps by 11.8 per cent. When performing basic block scheduling, we found each block had an average of 6.8 operations (over all benchmarks). On average, an ebb consisted of 1.8 basic blocks and 12.4 operations. Dominator paths averaged 2.2 basic blocks and 15.1 operations, each.
We also measured the amount of time required to schedule. The scheduling times for each benchmark are shown in table 3. In two runs, the average scheduling time for all benchmarks was 88 seconds for basic block scheduling, 92 seconds for ebbs, and 2297 seconds for dps. This comparison is a bit unfair. Several of our C codes have many functions in each iloc module. Thus dps is performing the dominator analysis for the whole file every time a dominator-path is scheduled. The go benchmark contributed 2109 seconds alone. We totaled times for the Fortran benchmarks (all iloc files contain a single function), and a random sampling of the single function C codes (about 24 functions). The scheduling times were 56 seconds for basic block scheduling, 50 seconds for ebbs, and 105 seconds for dps. If we eliminate fpppp, which actually scheduled faster with ebbs than basic block scheduling, we get times of 8 seconds, 10 seconds, and 49 seconds, respectively. 
The TI TMS320C62xx Architecture
The Texas Instruments TMS320C62xx chip (which we will refer to as tms320) is one of the newest fixed point dsp processors [24] . From a scheduling perspective it has several interesting properties. The tms320 is a vliw that allows up to eight operations to be initiated in parallel. All eight functional units are pipelined, and most operations have no delay slots. The exceptions are multiplies (two cycles), branches (six cycles), and loads from memory (five cycles). nops are inserted into the schedule for cycles where no operations are scheduled to begin. The nop operation takes one argument specifying the number of idle cycles. This architecture has a unique way of "packing" operations into an instruction. Operations are always fetched eight at a time. This is called a fetch packet. Bit zero of each operation, called the p-bit, specifies the execution grouping of each operation. If the p-bit of an operation o is 1, then operation o+1 is executed in parallel with operation o. (I. e., they are started in the same cycle). If the p-bit is 0, then operation o + 1 begins the cycle after operation o. The operations that execute in parallel are called an execute packet. All operations in an execute packet must run on different functional units, and up to eight operations are allowed in a single execute packet. Each fetch packet starts a new execute packet, and execute packets cannot cross fetch packet boundaries. This scheme and the multiple-cycle nop operation described above, allow the code for this vliw to be very compact.
We have modified our scheduler to target an architecture that has the salient features of the tms320. Of course, there is not a one-to-one mapping of iloc operations to tms320 operations, but we feel our model highlights most of the interesting features of this architecture from a scheduling perspective. Our model has eight fully pipelined functional units. The integer operations have latencies corresponding to the latencies of the tms320. Since iloc has floating point operations and the tms320 does not, these operations are added to our model. Each floating point operation is executed on a functional unit that executes the corresponding integer operation. Latencies for floating point operations are double those for integer operations. All iloc intrinsics (cosine, power, square root, etc.) have a latency of 20 cycles. The experiments in the last section assumed perfect branch prediction. However, the tms320 has no mechanism for predicting branches. Thus, every control-flow operation (including an unconditional jump) incurs a five cycle delay to refill the pipeline. We simulate this by adding five cycles to the dynamic instruction count each time a branch, jump, subroutine call, or subroutine return is executed.
Our static instructions counts reflect the tms320 fetch packet/execute packet scheme. We place as many execute packets as possible in each fetch packet. nops in consecutive cycles are treated as one operation, to be consistent with the multiple-cycle nop on the tms320. Each basic block begins a new fetch packet. Table 4 shows the dynamic instruction counts for our tms320-like architecture. Static instruction counts (i.e., fetch packet counts) are reported in table 5. In dynamic instruction counts, we see improvements over basic block scheduling similar to those seen for the other architecture. On average, ebbs showed a
