Many techniques have been proposed for exploiting instruction-level parallelism, ranging from the optimal and expensive but ignoring resource constraints, to various forms of introducing resource constraints. One of the most aggressive of these techniques is resource-constrained software pipelining (RCSP). RCSP works by repeatedly scheduling successive iterations of a loop in parallel until the data and resource dependence structure of the loop causes the process to converge on a repeating scheduling pattern. This repeating pattern is then used as the new loop body. In principle, this process can be made optimal with respect to full unrolling and scheduling of the loop. Of course, this is not the same as absolute optimality; however, given the NP-hard nature of the problem and the results of Schwiegelshohn et al., this may be the strongest form possible for general loop pipelining. The main drawback of RCSP is that, in practice, its space/time overhead can be fairly expensive. In this paper, we present resource-directed loop pipelining (RDLP), a new approach that attempts to retain many of the advantages of RCSP while minimizing the expense. It does so by allowing the availability of target resources to in some sense guide the application of parallelism exposing and parallelizing transformations. One of the key features of RDLP is the separation of control heuristics from transformations that allow the loop pipelining to be as general as the underlying system of code motion transformations. Results are presented which show that even with very unsophisticated heuristics, RDLP achieves roughly the same performance as RCSP, while providing a fourfold decrease in the space/time cost; moreover, we show that RDLP exposes just enough' parallelism-incurs a minimum of code explosion-to maximally utilize resources.
INTRODUCTION
This paper presents a new approach for exploiting instruction-level parallelism, called resource-directed loop pipelining (RDLP). Exploiting more than trivial amounts of parallelism at the instruction-level requires loop pipelining (i.e. overlapping the execution of successive iterations of a loop), and literally dozens of pipelining techniques have been proposed over the last several years. So, why do we need`yet another one'? The answer is that resourcedirected loop pipelining is not`yet another one', since it differs from virtually all other loop pipelining techniques in one important way: RDLP does not rely on the builtin, systemic constraints (i.e. built-in heuristics) traditionally used to control the pipelining process. Instead, RDLP makes heuristics explicit, precisely de"ned, and tunable for different application domains and target architectures.
Heuristics are essential for dealing with NP-hard problems such as resource-constrained scheduling. However, by encoding important aspects of these heuristics directly within the pipelining algorithm itself, conventional approaches arbitrarily, and sometimes severely, limit the performance of both the compiled code and the compiler. For example, modulo scheduling [1] , arguably the most widely used pipelining technique, also has some of the strictest systemic constraints: conditionals must be if-converted, 1 and each (minor) iteration from the original loop within the "nal pipelined loop (major iteration) must have exactly the same schedule (i.e. the same operations scheduled at the same time relative to the beginning of each minor iteration). Both of these constraints can severely limit the possible schedules produced by modulo scheduling.
Conceptually, the general loop pipelining problem consists of three main tasks: exposing code to parallelize using loop transformations (e.g. unrolling), parallelizing the code using scheduling transformations (e.g. list scheduling, percolation scheduling etc.), and "nally determining when to stop pipelining, or`converge' on a "nal schedule. In order to recognize/force convergence of loop pipelining, virtually all loop pipelining techniques solve a simpli"ed abstraction of the general loop pipelining problem that places constraints on how the code is exposed, how it is parallelized, or more typically, both. For example, the problem abstraction used by modulo scheduling introduces the`"xed iteration' requirement that constrains both how code is exposed (i.e. only expose the "xed iteration) and how it is parallelized (i.e. the minor iteration schedule must remain "xed-the only scheduling allowed is in "nding the initiation interval,
Loop Unrolling
Loop Shifting or starting over with a new "xed iteration).
Resource-directed loop pipelining adopts a`back to basics' approach to loop pipelining in which we try to solve the problem of maximizing performance directly, without relying on simplifying abstractions. For any target architecture, maximizing performance equates to maximizing the effective utilization of the highly parallel, specialized, and/or irregular resources of the speci"c target. RDLP attempts to accomplish this by using a completely general system of parallelism exposing and parallelizing transformations. Unlike other techniques that use built-in constraints (i.e. built-in heuristics) to determine how much parallelism is exposed or how it is exploited, RDLP places no a priori restrictions on either, relying instead on a separate suite of heuristics to control the process that can be tuned for different target architectures, application domains and cost versus performance trade-offs.
RDLP works by repeatedly exposing and exploiting parallelism within a loop until resource dependencies or loop-carried (i.e. cyclic) data dependencies prevent further parallelization, or a speci"ed cost versus performance goal is achieved. Parallelism is exposed using loop unrolling and loop shifting, 2 both of which are illustrated in Figure 1 . After each unroll or shift transformation, parallelism is exploited by parallelizing (i.e. compacting) the current loop body using a provably complete system of trailblazing percolation scheduling (TiPS) transformations [2] .
Both unrolling and shifting have the effect of exposing code from succeeding iterations for parallelization. The key difference between the two is that whereas loop unrolling exposes a new iteration in its entirety, thus increasing the amount of code to parallelize, loop shifting simply restructures the loop, essentially`moving' operations across the backedge, without increasing the overall amount of code within the loop body.
The goal of RDLP, when allowed by cost versus performance trade-offs, is to expose just enough parallelism to maximally utilize the available resources of the target architecture. Resources are`maximally' utilized if there exists a kind of resource (e.g. ALU, FALU etc.) for which 2 Loop shifting refers to`unwinding' a loop so that its loop head becomes a true successor of each of its predecessors (i.e. the successor(s) of the head become the new loop head(s) and the original head becomes the last instruction in the loop), thus exposing operations from subsequent iterations to be scheduled in parallel with preceding iterations. utilization is 100% over the entire loop, 3 or loop-carried dependencies prevent any further parallelization.
If the loop body does not contain enough operations to maximally exploit the available resources, then loop unrolling is used to increase the number of operations in the loop until resources can be maximally utilized by the exposed parallelism, or cost versus performance goals are achieved. Unrolling and then parallelizing or compacting the loop body has the effect of overlapping the execution of the unrolled iterations, but cannot in general guarantee maximal utilization of target resources due to sequentiality imposed at the end-points of the new`multi-iteration' loop bodyeven though there may be enough operations in the loop to utilize resources maximally, the`structure' of the loop may prevent that parallelism from actually being exploited. For instance, if the loop body consists of k ≥ 1 compacted iterations 4 of the original loop, then there is still no overlap between iterations k and k + 1, which may yet be possible and necessary if resources are to be utilized maximally. In this case, RDLP restructures the loop, using loop shifting, to remove whatever sequentiality may exist at the endpoints of the (possibly unrolled) loop body. Loop shifting (and compacting) continues until resources are maximally utilized, or cost versus performance goals are achieved. Before RDLP: code for Figure 2 snapshot (a) executes one iteration every 15 time steps.
After RDLP: code for Figure 2 snapshot (e) executes two iterations every seven time steps. Figure 2 shows the progression of RDLP for Livermore kernel 1 in the form of a graphical representation of the code obtained from VISTA (visual interface for scheduling transformation and analysis) [3] , a visualizationparallelization tool that is part of the EVE mutation scheduling compiler (see Section 4). Figure 2 consists of "ve snapshots, labelled (a)-(e), that, as described below, show the resource utilization of the loop at each stage of the pipelining process, by`shading' instructions in proportion to their resource utilization. In each snapshot, the outermost rectangles represent very-long instruction word (VLIW) instructions, in this case for a VLIW architecture with four general purpose functional units (capable of executing arbitrary arithmetic and logical operations), and two conditional branch units. 5 The execution semantics for instructions are modelled after the IBM VLIW architecture [4] . Notice that the graphical representation of each instruction is divided into two segments.
An example
For each instruction, the left-hand segment is shaded in proportion to the functional unit utilization at that node. For instance, in snapshot (a), the code has not been parallelized yet, so each VLIW instruction contains only one operation or conditional branch, and therefore, the left-hand segment of each instruction (except for the instruction containing the loop control conditional) is shaded in by 25% (i.e. only one out of a possible four operations have been scheduled in the instruction). The shading in the right-hand segment is used simply to indicate which instructions are part of the loopif the right-hand segment is shaded for an instruction, then that instruction is part of the loop. Note that the loop head and tail instructions of the loop are designated graphically by slightly wider rectangles in each of the snapshots (the relative widths of the outer rectangles do not signify a larger number of resources for these instructions).
To help in understanding the snapshots shown in Figure 2 , the actual N -addr code corresponding to the initial and "nal snapshots is shown in Figure 3 . The N -addr code is in SSA form [5] and its opcodes are essentially those of the MIPS architecture. Note that for brevity only the loop body is shown, without loop control or phi functions. Figure 2 , snapshot (a), shows the loop before pipelining. The loop contains 14 operations and one conditional jump (for loop control), initially scheduled one per instruction. Snapshot (b) shows the loop after the original loop body has been compacted. Notice that the loop control conditional has`moved up' from the bottom of the loop a couple of instructions, thus causing bifurcation in the control #ow (the left-hand branch is the loop exit, as indicated by the absence of shading in the right-hand segment of those instructions). The shading in the left-hand side of the loop instructions in snapshot (b) indicates that functional unit utilization is 100% in the "rst two instructions, and then tapers off until the end of the loop. At this point, RDLP attempts to determine whether there are enough operations currently in the loop body to maximally utilize the resources, or whether a loopcarried dependence prevents further utilization. Only trivial loop-carried dependencies on induction variables exist for this loop, so making this determination essentially involves determining whether or not the number of operations competing for the most heavily used type of resource is evenly divided by the number of available resources of that type. In this case, the most heavily utilized resource type is the generic functional unit (as opposed to the conditional branch units), so given that 14 operations/4 functional units = 3.5, no matter how well loop shifting might overlap these operations, it is not yet possible for the functional units within the loop to be fully utilized. Assuming that the performance goal is 100% utilization and that cost constraints (e.g. code size, compile time) allow it, RDLP would then unroll the loop in order to increase the number of operations in the loop body. Snapshot (c) shows the loop after it has been unrolled once and compacted. Notice that resources are now almost completely utilized within the loop. Moreover, the loop now has 28 operations, 6 which is evenly divided by the four available functional units, so no further unrolling is needed. At this point, RDLP uses loop shifting to "ll the few remaining unutilized resources at the bottom of the loop. Snapshot (d) shows the loop after being shifted one instruction and compacted. Notice that resource utilization has increased, as indicated by the darker shading in the left-hand segment. However, resources are not yet fully utilized, so more shifting is indicated. Snapshot (e) shows that one more loop shift and compaction is enough to completely utilize the resources for this loop on this target architecture.
Notice that RDLP places no a priori limitations on how much parallelism is exposed, or how it is exploited-the amount of unrolling/shifting and the aggressiveness of the parallelization is determined entirely by a separate set of heuristics that depend on the application domain, target architecture and cost versus performance trade-offs.
The remainder of this paper is organized as follows. Section 2 describes the related work that motivates RDLP, Section 3 describes the RDLP algorithm, Section 4 describes a particular implementation of RDLP, and "nally, Section 5 presents results showing its effectiveness.
RELATED WORK
Resource-directed loop pipelining (RDLP) is motivated by three other well-known pipelining techniques: modulo scheduling [1] , resource-constrained software pipelining (RCSP) [7] and enhanced pipelined percolation scheduling (EPPS) [8] . Each of these techniques has been proven to 6 In this example, unrolling and compacting one more iteration resulted in exactly 14 more operations for a total of 28; however, in general, the incremental increase in the number of operations after compaction need not equal the number of operations in the original loop body since one of the strengths of RDLP is that other code transformations and optimizations which can change the number of operations, like load-afterstore elimination, redundant operation removal, constant folding etc. can be integrated as part of the parallelization process. See [6] for a discussion of how this can be done elegantly and uniformly and why it is important. have theoretical and/or practical advantages; nevertheless, all three rely on built-in, systemic constraints that limit their ef"ciency and/or performance. Below we describe each of these techniques in turn, with emphasis on their strengths, which we try to retain in RDLP, and their weaknesses, which we try to avoid.
Modulo scheduling
Modulo scheduling is by far the most widely used of these techniques and has proven itself to be very useful in practice. The basic algorithm consists of scheduling one operation at a time, such that if an operation from iteration zero is scheduled at time t, then for each subsequent iteration i, the same operation from iteration i is scheduled at time t + is, where s is a "xed value, called the initiation interval, that depends on the resource-and data-dependence structure of the loop body. In effect, modulo scheduling schedules a single iteration, and then repeats it every`initiation interval' time steps until resource constraints prevent any further overlapping of iterations.
The advantages of modulo scheduling are that it is simple, ef"cient and in some sense`guided' by target resources (i.e. it continues to`unroll' the "xed iteration until resource constraints prevent any further overlap). The disadvantages of modulo scheduling are twofold. First, if the loop contains conditional control #ow, then it must be if-converted, whereby operations are replaced by guarded operations that depend on the predicates of the conditionals that control their execution, and the conditionals themselves are deleted. In this way, control dependencies are replaced by data dependencies, at the cost of forcing operations from disjoint control paths to compete for the same resources, even though in the "nal schedule they would never be executed simultaneously. The second disadvantage of modulo scheduling is simply the requirement that each iteration (from the original loop) in the pipelined loop have exactly the same schedule. This is not only a completely arbitrary restriction on the possible code motions, but also complicates or prevents the application of other transformations or optimizations that may be exposed during pipelining (e.g. load-after-store elimination, redundant operation removal, constant folding, copy propagation etc.).
Resource-constrained software pipelining (RCSP)
Resource-constrained software pipelining (RCSP) [7] is one of the more aggressive pipelining techniques and provides a foundation for discussing software pipelining in the presence of resource constraints. The algorithm consists of repeatedly unrolling and scheduling a loop until a repeating scheduling pattern emerges. This repeating pattern is then made the new (pipelined) loop body, and everything before and after it becomes pre-and post-loop code respectively. In order to ensure that a repeating pattern will form, RCSP requires that operations from each unrolled iteration be the same, and are scheduled in the same order (though, the resulting schedules for each unrolled iteration need not be the same).
The main advantages of RCSP are that it gracefully handles conditional control #ow within the loop body (e.g. if-conversion is not necessary) and it is`guided' by target resources in the sense that the degree of parallelism in the loop increases (via unrolling) until a repeating scheduling pattern forms. In fact, with respect to full unrolling of the loop, and its own self-imposed scheduling constraints (i.e. each iteration has the same operations scheduled in the same order), RCSP is provably optimal. 7 The principal disadvantage of RCSP is that the built-in, systemic constraint of unrolling and scheduling until a pattern repeats, often causes the loop to be unrolled many more times than is needed to fully utilize the available resources. Another disadvantage of RCSP is that the requirement that each iteration have exactly the same operations scheduled in the same order, though less severe than for modulo scheduling, still arbitrarily limits the kinds of transformations and optimizations that can be performed during scheduling.
Enhanced pipelined percolation scheduling (EPPS)
Enhanced pipelined percolation scheduling (EPPS) [8] adopts a very different approach to pipelining from modulo scheduling and RCSP. Like the other methods, EPPS is resource constrained, however unlike the others, it is not guided by resources. Whereas modulo scheduling and RCSP are both capable of increasing the degree of parallelism of the loop (via unrolling) in response to resource availability, EPPS relies exclusively on loop`shifting' (essentially moving operations across the backedge), thus exposing operations from successive iterations to be scheduled in parallel, but limiting the total pipelined loop body to a single iteration (of the original loop). EPPS continues to shift the loop until each instruction from the original loop body has been shifted.
The advantages of EPPS are that it is simple, ef"cient, gracefully handles conditional control #ow and places no restrictions whatsoever on the transformations and optimizations that can be performed within the loop body. The main disadvantage of EPPS is that it arbitrarily limits the 7 Of course, optimal with respect to full unrolling is not the same as absolute optimality; however, given the NP-hard nature of the problem and the results of [9] , in practice, optimality with respect to unrolling may be the strongest form possible for general loop pipelining algorithms.
degree of parallelism to what is available in a single iteration of the original loop, regardless of how well or poorly it utilizes the available resources. Another disadvantage is that the termination condition is completely arbitrary, thus potentially causing more code explosion than necessary if too much shifting is done, or less performance than possible if too little shifting is done.
How RDLP relates
The resource-directed loop pipelining (RDLP) technique presented in this paper attempts to retain the advantages of all three of the above-mentioned techniques, while avoiding the limitations of any of them. Like modulo scheduling and RCSP, RDLP is`guided' by target resources, so that the degree of parallelism can grow in proportion to available resources; like RCSP and EPPS, RDLP gracefully handles conditional control #ow and, like EPPS, RDLP places no restrictions at all on what transformations and optimizations can be performed within the loop body. By explicitly separating the parallelism exposing and parallelizing transformations from each other and from the heuristics that control them, RDLP does not suffer from any of the built-in, systemic constraints on functionality inherent to the other techniques. Note that RDLP is no more or less heuristic-based than any other (tractable) loop pipelining technique-the difference is that RDLP makes the heuristics explicit and tunable, rather than "xed and hidden within the mechanics of the algorithm.
THE ALGORITHM
This section describes the general RDLP technique, a speci"c implementation of which is then described in Section 4. Figure 4 shows the RDLP algorithm. RDLP starts by compacting the initial loop body, and then repeatedly re"ning that schedule until a desired cost versus performance goal is achieved, "rst by unrolling and compacting to increase the degree of parallelism within the loop, when needed, and then by shifting and compacting to eliminate any sequentiality that may still exist at the end-points of the loop.
Compaction (i.e. parallelization) of the loop body at each stage of the pipelining process is done using the COMPACT routine. One of the advantages of RDLP is that the method used for compacting the code is orthogonal to the pipelining algorithm itself-any technique can be used since DONE-UNROLLING and DONE-SHIFTING (see below) depend only on the current resource utilization of the loop and the cost accrued so far, without regard to the mechanism used in restructuring the loop body. Note that while it is possible to imagine a context in which the compiler writer may decide to base cost constraints on some aspect of the underlying system of code motion transformations, this is not required by RDLP itself, and simply represents yet another degree of freedom that RDLP makes available to the compiler writer. In the implementation of RDLP described in Section 4, the COMPACT routine is based on trailblazing percolation scheduling (TiPS) [2] , a general and provably complete system of parallelizing transformations.
The UNROLL-COMPACT cycle
The UNROLL-COMPACT cycle is used to increase the number of operations in the loop, if needed, until resources can be maximally utilized by the available operations, or cost constraints are reached. The UNROLL routine is responsible for making a copy of the original (compacted) loop body at the end of the current partially pipelined loop (which may already contain multiple iterations from previous unrollings). Because we allow arbitrary optimizations to occur during compaction, possibly including very aggressive code transformations (e.g. mutations [6] ), it is not possible to determine a priori how much unrolling will be needed. Therefore, the UNROLL-COMPACT cycle is done incrementally and terminates when DONE-UNROLLING returns true. DONE-UNROLLING is responsible for determining when enough operations have been exposed for parallelization to maximally utilize the available resources, or cost constraints have been reached. ENOUGH-OPERATIONS-IN-LOOP determines whether or not resources can be maximally utilized by the operations currently in the loop body. 8 
Speci"cally, ENOUGH-OPERATIONS-IN-LOOP behaves as follows:
Let k be the number of different kinds of resource (e.g. ALU, FALU etc.): Note that in terms of resource constraints, multi-cycle operations are handled as in [7] , by treating each kcycle operation as a sequence of k one-cycle (stage-level) operations, and similarly treating each k-cycle functional unit as k one-cycle functional units (pipeline stages). This approach to representing resources, while simple, is still very robust. As seen in Section 5, this approach allows us to model heterogeneous register "les and different numbers and kinds of functional units with varying latencies. Since the heuristics that control RDLP depend on the context in which it is used, we will not attempt to give an exhaustive list of the possibilities. Rather, in this section, we will simply describe the three principal kinds of heuristics, and then in Section 4 we will describe one particular instantiation. For UNROLL-COST-CONSTRAINTS-ARE-REACHED, the following three classes of heuristics are applicable to most target architectures and application domains:
Code size constraints. Stop unrolling if the resulting code size would exceed a certain amount. The`cut-off' size can be "xed for all loops or dependent on the characteristics of speci"c loops, such as loop bounds (possibly obtained via analysis or assertions) or the relative importance of the loop (obtained via pro"ling, assertions or analysis). Compile time constraints. Only spend a certain amount of time unrolling. As for size constraints, this amount can be "xed for all loops or dependent on the characteristics of speci"c loops. Performance threshold constraints. Stop unrolling after a certain goal resource utilization can be achieved (like testing for the ability to maximally utilize resources, but instead of testing for 100% utilization, check for some utilization X < 100%). The performance threshold can be "xed, or can depend on the relative importance of the loop, as for the previous constraints, or can change dynamically depending on the total cost accrued so far (e.g. initially try for maximal utilization, but then periodically decrease the goal utilization after a certain amount of code explosion and/or compile time).
Mutation Scheduling

RDLP
The SHIFT-COMPACT cycle
The SHIFT-COMPACT cycle is used to schedule operations from successive iterations in parallel with preceding iterations, without increasing the number of iterations in the loop, until resources are maximally utilized or cost constraints have been reached. As for the UNROLL-COMPACT cycle, there is no way to tell beforehand how much shifting will be necessary since COMPACT has the freedom to perform other transformations and optimizations to the loop body that may have nothing to do (directly) with loop pipelining (e.g. load-after-store elimination). Therefore, the SHIFT-COMPACT cycle continues until DONE-SHIFTING returns true. DONE-SHIFTING is responsible for determining when resources are maximally utilized (using MAXIMAL-RESOURCE-UTILIZATION), or cost constraints have been reached (SHIFT-COST-CONSTRAINTS-ARE-REACHED). Using the same notation as above, MAXIMAL-RESOURCE-UTILIZATION returns true if for all paths p in the loop, at least one of the following is true:
e. some resource is fully utilized; 2. | p| = longest-cyclic-dep p , i.e. a loop-carried dependence prevents further parallelization where | p| is the number of instructions on path p.
SHIFT-COST-CONSTRAINTS-ARE-REACHED mirrors UNROLL-COST-CONSTRAINTS-ARE-REACHED. The only difference is that shifting, rather than unrolling, is constrained.
IMPLEMENTATION
Resource-directed loop pipelining (RDLP) is a general loop pipelining technique that could be implemented as part of most compilers for "ne-grain parallel architectures. This section describes one such implementation, as part of the EVE mutation scheduling compiler being developed at UCI. 10 Figure 5 shows a structural overview of the EVE compiler. One of the main objectives of EVE is to provide a completely general and powerful system of transformations, controlled by an independent suite of heuristics that can be tuned for different application domains, target architectures and cost versus performance trade-offs. RDLP is at the highest level of this system of`mutation scheduling' transformations, followed next by trailblazing percolation scheduling (TiPS) [2] , a non-incremental system of code motion transformations that schedules operations within a hierarchical representation of the control #ow graph, called the hierarchical task graph [11] , and "nally by the mutate transformation [6] , a mechanism that allows the expression used in computing any value to change`on the #y' during scheduling in response to changing resource constraints and availability. Mutation scheduling effectively integrates code selection, register allocation and instruction scheduling into a uni"ed framework in which contextsensitive trade-offs can be made between the functional, register and memory bandwidth resources of the target architecture. Indeed one of the main strengths of RDLP is that it allows arbitrary optimizations and transformations to be performed during the pipelining process, including very aggressive transformations such as code mutation, thus allowing RDLP to take advantage of opportunities for optimizations/transformations that are exposed by loop unrolling and/or shifting.
11
The current implementation of RDLP within EVE uses a few very simple cost constraints controlled by the following parameters, read as inputs by EVE: UNROLL LIMIT speci"es the maximum amount of unrolling for any loop, SHIFT LIMIT speci"es the maximum number of shift transformations to be applied to any loop, and SHIFT TRY LIMIT speci"es the maximum number of 11 A good example is load-after-store elimination which is often enabled by both unrolling and shifting, such as when a load from iteration i simply reads a value that was stored in iteration i − 1 (in this case the load can usually be eliminated and each use of it can be replaced by a use of the value that was stored). shift transformations to be applied without improving the schedule (i.e. as long as shifting improves the schedule at least once each SHIFT TRY LIMIT shift transformations, RDLP will continue shifting until resources are maximally utilized or the SHIFT LIMIT is reached, whichever comes "rst). Note that for the results presented below, our focus was on highlighting the ability of RDLP to expose`just enough' parallelism to maximally utilize resources, rather than on its equally important ability to be tuned for different cost versus performance trade-offs. Therefore the cost constraints were arbitrarily chosen to be large enough to have little impact on the results, i.e. in most cases, the pipelining terminated due to maximal resource utilization, rather than because cost constraints were reached. As will be shown below, even when given this ability to very aggressively exploit parallelism, RDLP still incurred a minimum of cost.
RESULTS
This section presents the results for two sets of experiments.
In the "rst, we compare RDLP against resource-constrained software pipelining (RCSP) [7] , arguably one of the most aggressive (albeit fairly expensive) pipelining techniques, and we "nd that RDLP achieves the same performance levels as RCSP, but at only a fraction of its cost. In the second set of experiments, we compare RDLP against an optimal (brute-force) loop pipelining algorithm and we "nd that not only is RDLP capable of achieving close to optimal performance levels, but exposes just enough parallelism to do so, providing maximum performance for a minimum cost.
RDLP versus RCSP
The comparison of RDLP versus RCSP is made using two experiments in which we compare the performance (in terms of speedup, i.e. the ratio of sequential to parallel cycles observed during execution) and the cost (as code size). Figures 6 and 7 show the results for both experiments. The benchmarks are the Livermore kernels. The RCSP results are taken from [7] and were generated with an earlier version of the EVE compiler 12 . The only (relevant) difference between the earlier version and the current version of EVE is that the system of transformations for the earlier version consisted of RCSP and TiPS, whereas in the current version of EVE, the system of transformations consists of RDLP, TiPS and mutate. In order to make a fair comparison between RDLP and RCSP, the mutate transformation was not used in computing the RDLP results presented in this section.
The target architectures for both experiments are (simulated) VLIW architectures with pipelined functional units and a single, 64-bit wide, shared register "le containing 32 registers. The execution semantics of the architectures are those of the IBM VLIW model [4] . The instruction set for each is essentially that of the MIPS and operation latencies are roughly the same as the Motorola 88110 superscalar: integer arithmetic, logical and shift operations take 3 cycles; #oating-point add and multiply, and integer multiply operations take 5 cycles; #oating point and integer division take 15 cycles, memory reads take 2 cycles on a cache hit, and 1 cycle on a cache write (cache misses stall the processor), and evaluating branch conditions takes 4 cycles. In the "rst experiment ( Figure 6 ) the target architecture has four homogeneous functional units, each capable of executing any operation. In the second experiment (Figure 7) , the target architecture has heterogeneous functional units: two ALU units responsible for executing integer arithmetic and logical operations; two SHIFT units for executing shift operations; two FALU units for executing #oating-point add and multiply, and integer multiply operations; two FDIV units responsible for executing #oating-point division operations, two MEM units responsible for executing load and store operations and two BRANCH units for executing conditional branches.
For both experiments the following largely arbitrary control parameters were used: UNROLL LIMIT = 9, SHIFT LIMIT = 100 and SHIFT TRY LIMIT = 20. Table 1 shows the actual amount of unrolling/shifting that RDLP applied to each loop, for each experiment. Note that 0  100  0  27  LL21  6  41  3  55  LL22  7  40  3  44  LL23  0  33  0  32  LL24  0  100  0  100 only inner loops are pipelined and kernels with multiple inner loops have one entry per loop. Notice that in only one case (Homogeneous/LL8) was the UNROLL LIMIT reached, and with few exceptions (LL15-17, 20, 24), shifting terminated long before the SHIFT LIMIT was reached. Thus, on average, pipelining converged due to maximal resource utilization, rather than cost constraints. On average RDLP achieves roughly the same speedup as RCSP-in Figure 6 , RDLP wins by about 6% and in Figure 7 , it loses by the same amount. However, in terms of code size, RDLP is four times more ef"cient than RCSP on average, and in some cases (e.g. kernels 7 and 9 in Figure 6 ) it provides an order of magnitude decrease in code size, while giving up only 10-13% in terms of speedup. The ability to schedule code freely, without the self-imposed constraints used by RCSP (and other techniques) to force/recognize convergence, in some cases, allows RDLP to achieve even greater speedups than RCSP while remaining much more ef"cient in terms of code size (e.g. kernels 16 and 18 in both experiments). Given the aggressiveness with which RCSP parallelizes code, it is no surprise that, in a few cases, RCSP obtains higher speedups than RDLP (e.g. kernels 7 and 9 in Figure 7) ; however, we anticipate that with better heuristics, even this small concession to RCSP can be avoided. In any case, the fact that even the very simple and naive heuristics used by RDLP in these experiments allowed RDLP to achieve an average speedup within ±6% of the RCSP speedup, while providing a more than a fourfold decrease in code size (and commensurate compile time), demonstrates the importance of using target resources to guide the parallelization process.
RDLP versus optimal
In order to highlight the ability of RDLP to expose just enough parallelism to maximally exploit the resources of "ne-grain parallel architectures, we present the results of six more experiments in which we compare the performance (in terms of speedup) and the cost (in terms of the number of unroll and shift transformations) of RDLP against results that are optimal with respect to the capabilities of the underlying system of code motion transformations. Thè optimal' results were produced as follows. We choose a maximum amount of unrolling to try, say U , and a maximum amount of shifting, say S, and then maximally compact the code for each possible combination of unroll and shift transformations within these limits (i.e. for each (x, y) ∈ {0 . . . U} × {0 . . . S}, unroll and compact x times, and then shift and compact y times). We try progressively larger values of U and S until performance ceases to improve. For the experiments presented in this paper, this happened at U ≤ 10 and S ≤ 20. Figure 8 shows the results of our experiments. This "gure contains three graphs that compare performance, in terms of average speedup, and cost, in terms of the amounts of unrolling and shifting, of RDLP against those of the`optimal' algorithm for each of six different target architectures with varying numbers and kinds of functional unit. For each target architecture, the average speedup and amounts of unrolling and shifting are over 14 benchmarks taken from the Livermore kernels: hydro, diffprod, firstdiff, ICCG, matrixmul, planckdist, 2dhydro, innerprod, BLE, tridiag, recurrence, eqofstate, ADI, and integrate. The target architectures used in all six experiments are simulated, unicycle VLIWs with threeway branching (i.e. up to two conditional branches per instruction) and 32 integer and 32 #oating-point registers. Each target uses the same instruction set, which is essentially that of the MIPS architecture. The "rst three targets listed in each graph have two, three or four respectively, of each kind of heterogeneous functional unit (FU): ALU, SHIFT, FALU, FMUL, FDIV and MEM. The last three targets listed in each graph have two, four or eight homogeneous FUs respectively (i.e. each FU is capable of executing any operation).
The following control parameters were used in all six experiments: UNROLL LIMIT = 20, SHIFT LIMIT = 40, and SHIFT TRY LIMIT = 4. Note that in these experiments RDLP was able to expose`enough' operations with at most three unrollings and 14 shifts, so any UNROLL LIMIT ≥ 3 and SHIFT LIMIT ≥14 would have yielded the same results.
For the optimal results, the amount of unrolling and shifting is the minimum amount for which the performance was maximized. Notice that on average RDLP achieves within 3% of the optimal speedup, and can achieve this with approximately one third of the unrolling that would be required for achieving maximum performance. RDLP does however require more shifting, but in general this is not very signi"cant since each shift transformation only increases the code size by one instruction. Note that even though the six target architectures used in these experiments have very different numbers/kinds of resources, there is little deviation between the ratios of RDLP performance (and cost) to the optimal performance (and cost): for all six experiments, RDLP achieves within 1% (the`two FUs' experiment) and 5% (`two of each') of the optimal speedup, usually with signi"cantly less code explosion than would be required to achieve the last few per cent in performance. This fact highlights two key features of RDLP: its ability to adapt to different target architectures and to expose enough, but no more, parallelism than is necessary to effectively utilize the resources of the speci"c target.
