As technology has advanced, the application space of Very Long Instruction Word (VLIW) processors has grown to include a variety of embedded platforms. Due to cost and power consumption constraints, many embedded VLIW processors contain limited resources, including registers. As a result, a VLIW compiler that maximizes instruction level parallelism (ILP) without considering register constraints may generate excessive register spills, leading to reduced overall system performance. To address this issue, this article presents a new spill reduction technique that improves VLIW runtime performance by reordering operations prior to register allocation and instruction scheduling. Unlike earlier algorithms, our approach explicitly considers both register reduction and data dependency in performing operation reordering. Data dependency control limits unexpected schedule length increases during subsequent instruction scheduling. Our technique has been evaluated using Trimaran, an academic VLIW compiler, and evaluated using a set of embedded systems benchmarks. Experimental results show that, on average, this technique improves VLIW performance by 10% for VLIW processors with 32 registers and 8 functional units compared with previous spill reduction techniques. Limited improvement is seen versus prior approaches for VLIW processors with 64 registers and 8 functional units.
INTRODUCTION
Very Long Instruction Word (VLIW) processors are currently used in a variety of embedded systems that require high performance within constrained operating conditions [Goossens et al. 1997; Faraboschi et al. 2000] . In an effort to minimize hardware, typical VLIW processors do not provide specialized hardware to support dynamic scheduling or out-of-order execution [Hennessy and Patterson 1996] . Rather, compile time scheduling is used to determine a fixed schedule for multiple operations performed in parallel on VLIW functional units. As a result, the runtime performance of a VLIW processor depends heavily on the efficiency of its compiler.
A typical VLIW compilation flow includes several optimization phases, including instruction scheduling and register allocation [Freudenberger and Ruttenberg 1991] . Instruction scheduling attempts to maximize instruction level parallelism (ILP) by scheduling as many operations as possible in parallel, which may require a large number of registers to hold variables generated in the schedule. Register allocation assigns a physical register to each variable. If a given schedule requires more registers than available physical registers, variables must be spilled to memory, regardless of the register allocation algorithm that is used. Given the latencies involved in main memory access, spills may result in significantly reduced system performance.
Due to cost and power consumption constraints of embedded systems, many embedded VLIW processors only contain a limited number of physical registers, exacerbating the possibility of memory spills. For example, the Freescale MSC8101 [Freescale Semiconductor, Inc. 2005] and TI C62x [Texas Instruments, Inc. 2000 ] VLIW processors contain 16 and 32 physical registers, respectively. In these situations, if a VLIW compiler simply maximizes ILP without considering register constraints, a substantial number of spills may be generated. However, simply minimizing spills may not always improve performance if application parallelism is negatively affected. In this article, we show that both of these issues must be taken into account during the early stages of compilation to achieve the best possible application performance.
A number of spill reduction techniques [Kim 2001; Govindarajan et al. 2003; Touati 2005] , including register pressure control [Touati 2005 ], have been developed. As shown in Figure 1 , register pressure control is applied before instruction scheduling and register allocation. Using data dependencies, the measure step estimates the maximum register requirement (Max reg) of all possible straightline code schedules. If Max reg exceeds the number of available physical registers, Phy reg, a reduction step is used to reduce Max reg to Phy reg. The reduction step allows subsequent instruction scheduling to focus on improving parallelism.
In this article, we present a new register pressure control technique based on the reordering of operations. Our technique reorders operations to reduce the number of required registers and spills while attempting to maintain instruction level parallelism. To demonstrate its benefit, we have integrated this algorithm into an academic VLIW compiler, Trimaran [Chakrapani et al. 2004] . As shown in Section 6, the algorithm can achieve at least a 10% reduction in benchmark execution time compared with previously published approaches [Kim 2001; Govindarajan et al. 2003; Touati 2005 ] for a VLIW architecture with 32 registers and 8 functional units at the cost of a modest compile time increase. For 64-register VLIW architectures with lower register pressure, all approaches perform roughly equally.
The remainder of this article is organized as follows. In Section 2, a brief discussion of previous work is presented. Section 3 provides background information and discusses the limitations of previous techniques. Section 4 presents a heuristic called Tetris that focuses on reducing the maximum register requirement (Max reg) without considering data dependencies. In Section 5, an extension of Tetris, called Tetris-XL, is presented. This algorithm considers both register reduction and data dependencies. Our experimental approach and results are presented in Section 6. Section 7 provides a summary of this work.
RELATED WORK
Previous work in spill code reduction can be put into several categories, general register allocation, schedule-sensitive register allocation, register-sensitive instruction scheduling, integrated register allocation and instruction scheduling, and register pressure control.
General register allocation aims to minimize the number of spills based on a given schedule. Graph coloring-based register allocation [Chaitin 1982; Briggs et al. 1989; Briggs 1992; Bouchez et al. 2007 ] is one of the most effective register allocation approaches. Graph-based algorithms build an interference graph based on a given schedule, in which each node represents a variable and an edge between two nodes indicates that two variables cannot share the same physical register. If there are not enough physical registers to hold all variables in the interference graph, spill code must be inserted to transfer some variable storage to In the Register on Demand (RoD) algorithm [Cilio and Corporaal 1999] , an operation is scheduled only if a free register and functional unit are present. Register assignment or spill insertion are done on the fly. Because spill insertion is based on the instantaneous register and schedule requirements, it may be difficult to consider the effect of operation interdependencies. A second integrated technique [Zeitlhofer and Wess 2003] ensures that only schedules that satisfy all register constraints are generated. As a result, final register assignment is guaranteed to be successful. Our new approach attempts to address performance by considering an assessment of operation orderings while leaving specific register allocation and scheduling to later in the compilation flow.
Unlike the above approaches, register pressure control has more freedom in moving operations to reduce spills. Berson presented a technique called unified resource allocation (URSA) to reduce register pressure so that the schedule generated in subsequent instruction scheduling does not overuse registers [Berson et al. 1993 [Berson et al. , 1998 ]. This measure-and-reduce methodology is shown in Figure 1 . Experimental results [Berson et al. 1998 ] show that register pressure control outperforms both schedule-sensitive register allocation (PIG) and registersensitive instruction scheduling (IPS). Based on Berson's measure-and-reduce approach, Touati presented several new heuristics to further improve register pressure control [Touati 2001 [Touati , 2005 . These extensions are discussed in Section 3. In general, previous register pressure control approaches only consider the movement of an individual operation at a time. In situations with high ILP, it is more beneficial to consider groups rather than individual operations [Berson et al. 1998 ]. To address this limitation, our heuristics consider the effect of moving a set of operations together, which allows for a better register reduction.
The work described in this article extends an earlier version of our previous register pressure control algorithm [Xu and Tessier 2007] in several important ways. As described in Xu and Tessier [2007] , our earlier algorithm only considers register reduction when performing operation reordering. Although this approach aggressively reduces required register count and associated spills, generated data dependencies may lead to reduced instruction level parallelism. This issue may lead to increased application cycle counts as a result of longer instruction schedules. This article also provides a comparison of our register pressure control algorithms to results generated by MRIS [Govindarajan et al. 2003 ], an early register allocation technique. This comparison was missing from our earlier work.
BACKGROUND
In this section, we first present several basic definitions. After discussing the limitation of previous techniques [Touati 2001 [Touati , 2005 , our performanceenhancement algorithm is presented.
As shown in Figure 2 (a), data dependencies of the input code can be represented in a data dependency graph (DDG), G(V , E). A DDG contains a set of nodes V and a set of directed edges E = u, v : u, v ∈ E. A node u ∈ V represents an operation that defines variable u. A directed edge (u, v) ∈ E represents a data flow, where node v uses the variable defined by node u.
The delay of node u is equal to d (u) clock cycles. Node u reads registers on the first cycle of d (u) and writes the register on the last cycle. Our heuristics consider both unit and multicycle delay operations. However, for demonstration purposes, subsequent examples shown in the figures assume all nodes have unit delay, d (u) = 1. Additional terms are defined as follows.
-Pred(u) = v ∈ V : (v, u) ∈ E is a set of predecessor nodes required by u.
Figure 2(a) shows that Pred(E) = {B, C, D}.
is a set of successor nodes that use u as an input.
Figure 2(a) shows that Succ(A) = {B, C, D}. Succ(u) . A node v, which is a Use(u), uses variable u as an input. -Lv(u) is the live range of variable u. Lv(u) is the distance from node u to the last Use(u). As shown in Figure 2 (a), live-range Lv(B), is from node B to node E, the only node which uses variable B. Live ranges are independent of original statement ordering and are only dependent on variable definition and use nodes. -An excessive set (ES) is a maximal set of nodes (variables) in a DDG, which can be alive simultaneously such that the size of the set, Max reg, exceeds the number of available physical registers, Phy reg. Formally, ES is the maximal set of nodes v ∈ V that satisfies the following conditions:
Multiple excessive sets may exist for a given DDG.
For a given DDG, the maximum register requirement is the largest number of variables that are alive simultaneously. Since the instruction schedule is not fixed until the instruction scheduling phase, the maximum register requirement is estimated using the data dependencies of straightline code. In Berson et al. [1993] , it was shown that the maximum register requirement of a given DDG can be estimated by applying a minimum chain decomposition based on the Dilworth algorithm [Dilworth 1950] . In this article, we use an improved estimation technique called register saturation [Touati 2005 ]. Previous results show that the estimated result is within one register of the measured maximum register requirement [Touati 2005] .
Using register saturation [Touati 2005 ], the maximum register requirement of the DDG in Figure 2 (a) is 5. As shown in Figure 2 (b) , if E is scheduled last, 5 variables {B, C, D, F, G} are alive simultaneously, since variables {B, C, D} required by node E and {F, G} are output variables. If there are less than 5 physical registers, {B, C, D, F, G} becomes an excessive set, ES.
To reduce the size of the excessive set, live ranges of variables in the excessive set must be separated so that they do not overlap. In general, separating the live ranges of two variables u and v can be achieved by serialization (u → v) or serialization (v → u) , which is defined in the following text.
-A serial edge (w to v) is a directed edge from w to v. Serial edge (w to v)
enforces an ordering such that variable v cannot be written before all inputs to node w have been read. -Serialization (u → v) enforces an ordering such that Lv(v) begins after Lv (u) ends. Formally, serialization creates serial edges (w to v):
It has been proven [Touati 2005 ] that minimizing the excessive set size via serialization is NP-hard. To address this problem, Touati [2005] presented a greedy serialization technique that evaluates all possible serializations between any pair of variables and selects the one that can best reduce the maximum register requirement while increasing the critical path the least.
To reduce the excessive set {B, C, D, F, G} in Figure 2( b) , greedy serialization selects serialization (D → F ), as shown in Figure 2 (c) . To force an ordering so that F cannot be scheduled earlier than D, two serial edges (E to F ) and (G to F ) are inserted into the original DDG. After applying serialization (D → F ), Max reg of the augmented DDG in Figure 2 (c) is reduced from 5 to 4. The new excessive set is {E, C, D, G}, where {C, D} are required by F and {E, G} are output variables. Due to serial edges (E to F ) and (G to F ), the critical path changes from A-B-E to A-B-E-F.
A limitation of greedy serialization is that only a single serialization is considered by the algorithm at a time, limiting trade-offs across multiple potential serializations. This greedy behavior often leads to poor performance. Figure 2 (c) shows that serialization (D → F ) requires a serial edge (G to F ). Additionally, serialization (D → G) requires a serial edge (F to G). Applying both serializations causes a cycle between F and G, which makes scheduling impossible. Therefore, the excessive set {E, C, D, G} in the augmented DDG can no longer be reduced because serialization {D → F } prevents other serializations.
To address this problem, a better reduction can be achieved by considering multiple variable serializations simultaneously, serializations (set1 → set2), which is defined in the following text.
-AG(V , E, S E) is an augmented version of graph G(V , E), which includes serial edges S E. To allow for a feasible scheduling, AG(V , E, S E) must contain no cycles. -Two serializations are compatible if they can be applied together without creating a cycle in
As shown in Figure 2( d) , serializations ({B, C, D} → {F, G}) contain five compatible serializations: {B → F }, {C → F }, {D → G}, {C → G}, and {B → G}. Serialization {D → F } is not selected because it is not compatible with serializations {D → G} and {C → G}. A detailed discussion regarding compatibility checking is presented in Section 4.3. By applying these five compatible serializations, the maximum register requirement is reduced from five to three, which is one register less than the value achieved by greedy serialization. The new maximal set in the augmented DDG is {B, C, D}, in which all variables are required by E.
In order to select and serialize multiple variables simultaneously for the best reduction, we present a new reduction technique, called Tetris, in the next section.
TETRIS REDUCTION

Overview
Tetris reduction was first presented in Xu and Tessier [2007] . The basic idea of Tetris reduction originates from the popular computer puzzle game. In a Tetris game, players try to move and place given random blocks to fit into a fixed width constraint. Similarly, Tetris reduction tries to identify blocks (subset of variables) with suitable topologies and move them to reduce the size of the excessive set from Max reg to Phy reg (fixed width). The similarity can be observed in Figure 3 , which uses the example in Figure 2 (d) .
As shown in Figure 3 , Tetris reduction includes two steps, partitioning and serialization.
(1) Partitioning. This step identifies candidates for serialization. Variables in the excessive set are partitioned into two subsets, E0 and E1 based on two criteria. The first criterion indicates whether variable serializations from E0 to E1 are possible. The second criterion indicates how much register count reduction can be achieved. A detailed description of the partitioning algorithm is presented in Section 4.2. (2) Serialization. This step is applied to serialize variables in E1 after variables in E0 by inserting serial edges into the DDG. The ordering of variable serializations is decided such that a maximal set of variable serializations can be applied. The detailed serialization algorithm is discussed in Section 4.3.
Each code block may be subjected to multiple iterations of the Tetris algorithm until further improvement across all excessive sets is impossible.
Partitioning
4.2.1 Definitions. Before discussing partitioning in detail, additional definitions are presented.
-NSE(u,v) is a directed nonserializable edge (NSE) from variable u to v. This edge indicates that serialization (u → v) cannot be applied due to a path from v to at least one node, which is a Use(u). To maintain correct computation, both data and control dependencies are evaluated to generate NSEs. As shown in Figure 6 (a), serialization (B → C) is not possible since there is a path from C to E and E is a Use (B) . A detailed discussion of NSE checking is provided in Section 4.2.3. -Bidirectional NSE (u,v) indicates that there is a NSE in both directions, (u, v) and (v, u) . As shown in Figure 6( b) , the live range of variable B and variable C cannot be separated by any serialization due to a bidirectional NSE(B, C). -An NSE clique includes a set of variables. Each pair of variables has a bidirectional NSE so that all variables in the clique must be alive simultaneously. Formally, this relationship can be stated as u, v ∈ ES : ∀u, v, ∃ NSE(u, v) and NSE (v, u) . There are three NSE cliques, {B, C, D}, {F, G}, and {I }, shown in the example in Figure 6 (c) . A single variable is a degenerate case of an NSE clique. -Partition(E0,E1) represents a bipartitioning of the excessive set ES such that E0∩ E1 = ∅ and E0∪ E1 = ES. In Figure 6( d) , the excessive set is partitioned into E0 = {I }, E1 = {B, C, D, F, G}. -Pred set(E1) is the set of nodes of Pred(u), where u is a variable in E1, that are not in the excessive set. Figure 7 (a), Pred set(E1) = {A}. -Succ set(E0) is the set of nodes of Succ(u) , where u is a variable in E0, that are not in the excessive set. Figure 7 (a), Succ set(E0) = {J }.
To identify candidates for serialization, a two-step partitioning algorithm was developed to search for a partition (E0, E1) that achieves the best reduction. The first step is coarsening where variables are merged into two partitions. To improve the partition quality, the second step, refinement, is applied to minimize the partition cost by moving variables between the two partitions. The partition cost is evaluated during these two phases by examining relevant cost metrics.
-Coarsening Cost Metric. This metric evaluates the number of possible variable serializations from E0 to E1. To maximize serializations, nonserializable edges (NSE) from E0 to E1 should be minimized and variables in a NSE clique should stay in the same partition. Based on this metric, the partition in Figure 6 (d) is feasible since there is no directed NSE from E0 to E1. -Refinement Cost Metric. This metric evaluates whether the topology of a partition (E0, E1) can lead to register reduction. Since variables in Succ set(E0) can be simultaneously alive with E1 after serialization, preferably |Succ set(E0)| < |E0|. Similarly, Pred set(E1) may be alive with E0 after serialization, so preferably |Pred set(E1)| < |E1|. Based on this criterion, the partitioning in Figure 7 (a) is not a good candidate since |Succ set(E0)| = |E0| = 1.
Prepartitioning: NSE Construction.
Prior to the two partitioning steps, nonserializable edges between variables must be identified. As shown in Figure 4 , NSEs are created for variables that are code segment inputs (outputs), since they cannot be serialized after (before) other ES variables. Additionally, a breadth first search is performed to locate nodes that are successors of one variable in ES and descendents of another variable in ES. These ES variables also require NSEs because they cannot be serialized.
4.2.3
Partitioning: Coarsening. The main goal of the coarsening step is to minimize the number of nonserializable edges (NSE) from E0 to E1 based on the data dependencies of the DDG. All variable pairs in the excessive set are evaluated to check whether NSE edges should be inserted. If variable u and variable v have a bidirectional NSE between them, then they should be merged into the same partition. Therefore, the first step of coarsening is to create NSE cliques.
In general, if a set of variables is used by an operation, the variables in the set must be alive simultaneously, forming an NSE clique. Based on this rule, NSE cliques can be generated by a backward graph traversal. As shown in Figure 6 (c), NSE clique {F, G} is created first because output variables in the excessive set cannot be serialized. As a result of a backward traversal, four additional candidate NSE cliques are generated ({B, C, D}, {C, D}, {D}, and {I }). They are required by operations E, F , G, and J , respectively. Subsequently, the two largest nonoverlapping NSE cliques, {B, C, D} and {I }, are selected from the candidates. Detailed steps used for NSE clique generation are presented in Figure 5 .
After NSE clique generation, the coarsening step merges two NSE cliques together based on the number of NSE edges between them. As shown in 
4.2.4
Partitioning: Refinement. To improve the partition quality, the refinement step moves variables between two partitions. The partition quality is evaluated based on the partition cost, P cost, which is defined in the following text:
The C reg term represents the nonnegative gap between the maximum register requirement, Max reg and available physical registers, Phy reg.
The expected Max reg after serialization from E0 to E1 is calculated based on the topology of E0 and E1.
Max reg = Max((|E0| + |Pred set(E1)|), (|E1| + |Succ set(E0)|)).
( 3) 11:12
• W. Xu and R. Tessier As shown in Figure 7 (a), the example partition has the following topology:
Term C NSE includes two parts, as shown in Eq. (4). The first part, NSE(E0, E1), is the number of directed NSEs from E0 to E1. The smaller the value of NSE(E0, E1), the more variable serializations can be achieved from E0 to E1. To estimate the effect of NSEs on the achievable register reduction, a scalar factor α = 1/Max(|E0|, |E1|) is applied.
The second part, which is based on NSE(E0) and NSE(E1), is the number of directional NSE between NSE cliques in E0 and E1, respectively. This part is only effective when C reg is positive, indicating another round of reduction is required to avoid spills. To allow for further reduction, NSE(E0) and NSE(E1) should also be minimized. To estimate the effect of NSE(E0) and NSE(E1), scalar factors β 0 = (0.1 × C reg )/|E0| and β 1 = (0.1 × C reg )/|E1| are applied. The denominators in α, β 0 , and β 1 indicate a bias toward unbalanced partition sizes. Large |E0| or |E1| sets can be more easily reduced in later iterations of the Tetris algorithm. The 0.1 factor was determined via experimentation. The value indicates the relative importance of interpartition versus intrapartition NSEs.
For the initial partition shown in Figure 7 (a), C reg = 6 − 4 = 2, C NSE = β 1 ×NSE(E1) = 0.24, and the initial cost value P cost init = C reg +C NSE = 2.24. Note that NSE(E0) is 0, since there are no NSEs in E0 and NSE(E0,E1) is 0, since there are no NSEs between E0 and E1. To minimize the partition cost, a refinement step uncoarsens partitions and randomly moves NSE cliques from E1 to E0, then E0 to E1. In general, if the new partition cost, P cost new, is smaller than the initial partition cost, P cost init, then the move is accepted. The partition snapshot with the smallest P cost is recorded and chosen as the final partition. Detailed steps used for refinement are presented in Figure 9 .
As shown in Figure 7 (a), partition E1 contains two NSE cliques, {B, C, D} and {F, G}. Partition E0 contains only one NSE clique, {I }. The refinement of this example is described in the following text:
(1) Move {B, C, D} from E1 to E0, as shown in Figure 7 (b) . This move reduces C reg to 0 and C NSE to 0. Because P cost new (0) < P cost init (2.24), this move is accepted. (2) Move {I } from E0 to E1, as shown in Figure 7 (c) . Because this move does not change C reg and C NSE , it is also accepted.
In this example, the partition in Figure 7 (c) has a minimum P cost of 0. Therefore, the final partition is E0 = {B, C, D} and E1 = {F, G, I }. The refinement step first constructs Pred set and Succ set, which requires O(|ES| + |E|) steps. During clique swapping, two loops are used. The first loop requires |E0| iterations. During the loop, a clique is moved to E1 and a new cost is calculated. These actions require an update of NSE count, Pred set, and Succ set with a complexity of O(|E nse | + |ES| + |E|). Thus, the complexity of the first loop is O(|E0| * (|E nse | + |ES| + |E|)). The second loop that moves cliques from E1 to E0 has similar complexity as O(|E1| * (|E nse | + |ES| + |E|)). Overall, the complexity of the refinement step is O(|ES|
Serialization
Serialization is applied after partitioning. The goal of this step is to select and apply serialization (E0 → E1), forming a maximal set of compatible serializations from E0 to E1.
11:16
• W. Xu and R. Tessier As discussed in Section 3, compatible serializations represent a set of serializations that can be applied together without causing cycles that inhibit schedules. Since serialization (D → F ) (Figure 10(a) ) requires a serial edge (G to F ) and serialization (D → G) requires a serial edge (F to G), applying both serializations causes a cycle between F and G. Therefore, serializations (D → F ) and (D → G) are not compatible and they cannot be applied simultaneously. Formally, serialization for (u → v) cannot be applied if one of following conditions is true.
(1) |Pred(v)| = 0. This expression indicates that node v is an input node of the DDG. Because Lv(v) starts from the beginning of the DDG, v cannot be serialized after other nodes. Therefore, NSE(u, v) is inserted. (2) |Succ(u)| = 0. This expression indicates that node u is an output node of the DDG. Because Lv(u) does not end in the DDG, no other nodes can be serialized after u. Therefore, NSE(u, v) 
is a set of nodes: w ∈ V , where ∃ a path (v, w) .
If condition (3) is true, then (u → v) will create a cycle. A proof of this condition appears in Appendix in Lemma 1.
As discussed in Section 3, compatible serializations represent a set of serializations that can be applied together without causing cycles. Cycles caused by incompatible serializations can inhibit any possible schedule. In order to select a maximal set of compatible serializations from E0 to E1, a two-step serialization algorithm is used. The first step checks compatibility between serializations and creates a serialization interference graph (SIG) . The second step selects and applies a maximal set of serializations based on the SIG. 4.3.1 SIG Construction. To represent serialization compatibility, this step creates a new graph called a serialization interference graph (SIG) based on the partitioning result (E0, E1). SIG(SV, IE) contains a set of serialization nodes (snodes) SV and a set of nondirected interference edges (iedges) I E. These terms are formally defined as follows.
-A serialization node represents a potential serialization (u → v) if there is no NSE (u, v) . As shown in Figure 10 (a), the partition, E0 = {B, C, D} and E1 = {F, G, I }, has no NSE(E0, E1). Therefore, SV in Figure 10 (b) contains all nine potential serialization nodes from {B, C, D} to {F, G, I }. Formally, (u, v) ). -A serialization interference edge between two serialization nodes indicates that the nodes are incompatible and they cannot be applied together. A compatibility check evaluates all pairs of serialization nodes in a SIG. If two serializations (u → v) and (s → t) are incompatible, then there must be a cycle caused by serial edge (Use(u) to v) and (Use(s) to t), where Use(u) ∈ Succ(u) and Use(s) ∈ Succ(s). Such cycles can only exist if there is a path from t to Use(u) and another path from v to Use(s). A path from t to Use(u) indicates either there is a NSE (u, t) or t is a Use(u). Similarly, a path from v to Use(s) indicates either there is a NSE (s, v) or v is a Use(s). Therefore, serializations (u → v) and (s → t) are incompatible if and only if at least one of following compatibility check conditions is true:
(1) v is a Use(s) and t is a Use(u); (2) v is a Use(s) and there is a NSE(u, t); (3) There is a NSE (s, v) and t is a Use(u); (4) There is a NSE(s, v) and a NSE (u, t) .
A proof of these conditions appears in Appendix in Lemma 2. Detailed steps used for SIG construction are presented in Figure 11. 4.3.2 Maximal Serializations. Since two compatible serialization nodes are not connected (independent) in a SIG, determining the maximal set of compatible serializations for a SIG is equivalent to finding the SIG maximum independent set. This maximum independent set problem has previously been shown to be NP-complete [Cormen et al. 1990 ]. To address this issue, we have developed a heuristic to find the maximal set of serializations. Our heuristic uses a serialization cost function S cost that includes two terms, N deg and Crit inc.
The N deg term is the SIG node degree, the number of serialization interference edges connected to the node. As shown in Figure 10( b) , serialization node {D → F } has a node degree of 2, which indicates that it is not compatible with two other serializations. Crit inc represents the nonnegative critical path increase caused by serial edges. A serial edge from a Use(u) to v increases the critical path by:
where Etime/Ltime represents the earliest/latest time a node can be scheduled without increasing the DDG critical path. The earliest time (Etime) of an operation is calculated by a forward graph traversal using the following equation:
where w are variables required to calculate v. Delay(w) represents the delay of operation w. For demonstration purposes, all operations in example DDGs have the same delay of one clock cycle. As shown in Figure 10 (a), Etime of E depends on Etime and Delay of three operations, B, C, and D because {B, C, D} are required to calculate E. Similarly, the latest time (Ltime) of an operation is calculated by a backward graph traversal using the following equation: Etime and Ltime values can be determined using a graph slack analysis algorithm [Marquardt et al. 2000] , which requires a forward and backward graph traversal of the DDG. For a serialization node (u → v) requiring multiple serial edges, the critical path increase is decided as:
As shown in Figure 10(a) , the critical path of the original DDG is A-B-E. Based on Eqs (7) and (8), A and H have (Etime, Ltime) of (1, 1) . B, C, D, and I have (Etime, Ltime) of (2, 2). E, F , G, and J have (Etime, Ltime) of (3, 3).
For a serialization node (B → F ) shown in Figure 10( b) , only one serial edge (E to F ) is required. Based on Eq. (6), this serial edge increases the critical path by Max((Etime(E) − Ltime(F ) + 1), 0) = 1. The DDG critical path is increased from A-B-E to A-B-E-F. Based on Eq. (9), Crit inc for serialization node (B → F ) is equal to 1. Similarly, other serialization nodes in Figure 10 (b) can be calculated using previously mentioned equations. In this example, all serialization nodes have the same Crit inc of 1.
To maximize the total number of compatible serializations, the scalar factor γ is set to 1,024 so that the serialization node selection is first biased towards N deg values of 0, followed by N deg values of 1 with minimal Crit inc values. To control the critical path increase caused by serial edges, a threshold is set to prevent certain serializations. In our experiments, if the Crit inc of a serialization node is larger than three times the delay of a memory access operation, it is not applied, regardless of N deg.
The SIG in Figure 10 (c) illustrates that our heuristic continuously selects the serialization node with the smallest S cost until there are no more compatible serialization nodes left in the SIG. When a serialization node (u → v) is selected, serial edges (Use(u) to v) are inserted into the DDG. A detailed outline of this step is presented in Figure 12 .
For the example in Figure 10 (c), all serialization nodes have the same Crit inc of 1. The serialization node with the minimum N deg is selected and applied first. The serialization process is shown in the following text.
1. Select seven serialization nodes with N deg of 0.
-{B → F } requires a serial edge (E to F ).
-{B → I } requires a serial edge (E to I ).
-{B → G} requires a serial edge (E to G).
-{C → F } requires a serial edge (E to F ).
-{C → I } requires 2 serial edges, (E to I ) and (F to I ).
-{D → I } requires 3 serial edges, (E to I ), (F to I ), and (G to I ).
2. Select two serialization nodes with N deg of 1.
-{D → G} requires 2 serial edges (E to G) and (F to G).
-{C → G} requires 2 serial edges (E to G) and (F to G).
After
Step 2, there is no compatible serialization node left in the SIG, since {D → F } interferes with both {D → G} and {C → G}. As shown in Figure 10(d) , after applying the eight serializations, the augmented DDG contains six serial edges. Max reg of the augmented DDG is reduced from 6 to 4 with a critical path increase of 4. The new excessive set is {B, C, D, I }, and the new critical path is A-B-E-F-G-I-J.
A limitation of the Tetris heuristic is that it only focuses on reducing the maximum register requirement. As a consequence, Tetris may cause a significant increase in the DDG critical path, which may limit the final performance. In the previous example, a register reduction of 2 is achieved at the cost of a large critical path increase of 4. To address this limitation, we present an enhanced heuristic, Tetris-XL, in Section 5. Details of serialization selection are included in Figure 12 . In general, Tetris does not guarantee a reduction in Max reg. For example, a collection of node pairs in an initial DDG, with each pair connected by a single directed edge, cannot be serialized if the edge sources are graph inputs and the sinks are graph outputs.
Serialization: Complexity Analysis.
In the SIG construction step, serialization node generation requires O(|E nse |+|E0|*|E1|) steps. The interference edge generation evaluates O(|E0| 2 *|E1| 2 ) snode pairs. For each snode pair, the compatibility check takes O(|1|). Therefore, the complexity of the SIG construction step is O(|E0| 2 * |E1| 2 ) = O(|V | 4 ), although in most cases the size of the excessive set is much smaller than the total set of DDG nodes.
The S cost calculation step first applies a topological ordering of G, which takes O(|V| + |E|) steps. The etime calculation applies a forward traversal of G, which takes O(|V| + |E|) steps. Similarly, the l time calculation applies a backward traversal of G, which also takes O(|V| + |E|) steps. The S cost calculation takes O(|SV| + |IE|) = O(|E0|*|E1| + |E0| 2 *|E1| 2 ) steps. Therefore, the complexity of the S cost calculation step is O(|E0| 2 *|E1| 2 ) = O(|V | 4 ) . The serialization selection step first applies a sort, which requires O(|SV |l n|SV |) steps. Then, the selection step takes O(|SV| + |IE|) = O(|E0|*| E1| + |E0| 2 *|E1| 2 ) steps. Therefore, the complexity of the serialization selection step is O(|E0| 2 *|E1| 2 ) = O(|V | 4 ).
Serializations: Comparison to URSA.
Although there are significant differences, URSA provides a serialization approach that is somewhat similar to Tetris. Like Tetris, URSA first determines the excessive set of a code block. To reduce the excessive set, URSA moves an operation to a separate resource hole. If an operation is the last operation in the live range of several variables, this action may free up several registers. A limitation of the approach is that it only considers moving an individual operation at each step. It is observed that in many situations, it could be more beneficial to apply serializations in groups rather than as individual operations [Berson et al. 1998 ]. Compared with URSA, Tetris tries to apply a group of serializations by moving a partition of nodes in the excessive set to a position after other nodes. This approach may be especially beneficial when two parts of the excessive set are relatively independent. 
DELAY REDUCTION VIA TETRIS-XL
The main difference between Tetris and Tetris-XL is shown in Figure 13 . As discussed in Section 4, the Tetris partitioning heuristic includes the following steps. The first partitioning step is coarsening, which generates the initial partitions (E0, E1) while minimizing the number of NSEs from E0 to E1. The second partitioning step is refinement, which moves variables between partitions E0 and E1 to reduce the maximum register requirement. Because Tetris refinement only considers register reduction, subsequent serialization may significantly increase the DDG critical path.
To address this limitation, Tetris-XL applies an alternate refinement step, as shown in Figure 13 . The Tetris-XL refinement evaluates both the maximum register requirement and the critical path increase. The coarsening and serialization steps are the same for both Tetris and Tetris-XL.
To demonstrate the basic idea of Tetris-XL, a simple example is presented in Figure 14 . The example DDG in Figure 14 (a) contains seven operations. Assume all operations have unit delay so that the original critical path is three (e.g., A-B-D). As shown in Figure 14 (b) , when operations D and I are scheduled last, variables {B, C, H, F } are alive simultaneously and the maximum register requirement is four. If only three physical registers are available, variables {B, C, H, F } form an excessive set, which requires a reduction step.
As shown in Figure 14 (c), Tetris bipartitions the excessive set into E0 = {B, C} and E1 = {F, H}. After serializing {F, H} after {B, C}, the maximum register requirement is reduced from four to three. For the example shown in Figure 14 (c), the augmented DDG has a maximum register requirement of three {D, F, H} and a critical path of six (A-B-D-F-H-I). Since Tetris-XL considers both register reduction and critical path control, Tetris-XL removes variables from serializations if they increase the critical path or do not contribute to register reduction. These removed variables form a separate partition E2, which stays unchanged after serialization. For example, Figure 14 (d) shows that variable H is removed from serialization, the critical path is reduced and register reduction is maintained. If {F } is serialized after {B, C}, the augmented DDG requires the same maximum register count of three {D, F, H} and exhibits a smaller critical path of four (A-B-D-F), leading to better performance.
In the next section, the partition cost function used in the Tetris-XL refinement is discussed. The cost function is then applied to the same refinement example used to describe Tetris.
Tetris-XL Partition Cost
In Tetris refinement, variables are moved between partitions E0 and E1 to reduce the partition cost, P cost, which represents the maximum register requirement. Tetris-XL moves variables between E0 and E1 based on a new cost function, P cost XL, which takes both critical path increase and the maximum register requirement into account. If a variable in E0 or E1 does not contribute to register reduction, Tetris-XL moves it to a separate partition, E2. The same action is performed if a variable in E0 or E1 causes a large critical path increase. The Tetris-XL cost function, P cost XL, is defined as:
Compared with the Tetris cost function, P cost, shown in Eq.
(1), P cost XL adds a new C crit XL term that evaluates the effects of critical path increase. To consider the effect of the new partition E2 on the maximum register requirement, Tetris-XL also applies a C reg XL term to replace the C reg term in the previous Tetris cost function. The original C NSE term used in Tetris is kept unchanged in Tetris-XL. Figure 15 shows the calculation of the P cost XL for two partitions with C NSE = 0. Both partitions are taken from the example in Figure 14 . The C reg X L term represents the nonnegative gap between the maximum register requirement, Max reg XL and available physical registers, Phy reg.
In Tetris-XL, variables in partition E2 can be alive simultaneously with either E0 or E1. Therefore, the M ax re g XL term is calculated based on the topology of E0, E1, and E2.
Max reg XL = Max((|E0| + |Pred set(E1)|), (|E1| + |Succ set(E0)|)) + |E2|.
(12)
As shown in Figure 15( b) , the example partition has the following topology, |E0| = 2, |Succ set(E0)| = 1, |E1| = 1, |Pred set(E1)| = 0, and |E2| = 1. Based on Eq. (12), Max reg XL = Max(2, 2)+1 = 3. Similarly, the partition in Figure 15 (a) has Max reg XL = Max(3, 3) + 0 = 3. Because Phy reg = 3, both partitions have the same C reg X L = 3 − 3 = 0.
The second new term, C crit X L , estimates the DDG critical path increase caused by all serializations and evaluates its adverse effect on final performance.
In Eq. (13), Tot crit inc represents the critical path increase in terms of clock cycles. As shown in Figure 15( b) , the partition requires a serial edge (D to F ), which increases the critical path from A-B-D to A-B-D-F. Based on the assumption that all operations in the DDG have one clock cycle delay, the critical path is increased by one clock cycle. If a DDG contains operations with various delays, Eq. (6) can be applied to calculate the critical-path increase. When multiple serial edges are required, the critical path increase of consecutive serial edges is accumulated. Tot crit inc is decided by the maximum accumulated critical path increase. As shown in Figure 15( To decide whether a serialization can improve performance, Tetris-XL considers both the benefit of register reduction and the cost of critical path increase. For example, the serialization in Figure 15 (a) reduces the maximum register requirement by one but increases the critical path by three clock cycles. The register reduction of one saves one spill. As a consequence, it eliminates at least one memory store and one memory load operation. Assuming that both store and load operations have the same delay, Mem delay = two clock cycles, the register reduction of one may reduce the critical path by two × Mem delay = four clock cycles. Because the benefit of register reduction (four clock cycles) overweighs the cost of critical path increase (three clock cycles), the serialization is likely to improve the performance.
To make a trade-off between register reduction and critical path increase, Eq. (13) applies a constant scalar factor ρ = 1/(2 × Mem delay). For the partition in Figure 15 
Tetris-XL Refinement
Tetris-XL uses the cost metrics shown in Eqs. (10-13) and the algorithm shown in Figure 9 to reduce the partition cost, P cost XL. In general, Tetris-XL uncoarsens the initial partitions and randomly moves NSE cliques between E1, E0, and E2. If the partition cost after a move, P cost XL new is smaller than the initial partition cost, P cost XL init, then the move is accepted. The partition snapshot with the smallest P cost XL is recorded and chosen as the final partition. Figure 16 shows the Tetris-XL refinement applied to the example illustrated in Figure 7 . Because Tetris-XL applies the same coarsening step as Tetris, the same initial partition is generated, as shown in Figure 16 (a). In the initial partition, the maximum register requirement is six so that C reg X L = 6 − 4 = 2. Tetris-XL uses the C NSE term defined in Eq. (4), so the value remains 0.24. Serializing {B, C, D, F, G} after {I } requires five serial edges from J , which causes a Tot crit inc of two (from A-B-E to H-I-J-B-E). Assuming the memory access delay is two clock cycles, C crit XL = 1/(2 × 2) × 2 = 0.5. Therefore, the initial partition cost P cost XL init = 2 + 0.24 + 0.5 = 2.74. The Tetris-XL refinement of this initial partition is described in the following text.
(1) Move {B, C, D} from E1 to E0, as shown in Figure 16 (b) . After this move, the maximum register requirement is reduced to 4 so that C reg X L = 4 − 4 = 0. C NSE is reduced to 0 because there is no NSE from E0 to E1. Serializing {B, C, D, I } after {F, G} increases the critical path by 2 (from A-B-E to A-B-E-F-G) and C crit XL = 1/(2 × 2) × 2 = 0.5. Because P cost XL new (0.5) ≤ P cost XL init (2.74), this move is accepted. (2) Move {I } from E0 to E1, as shown in Figure 16 (c) . After this move, C reg XL remains as 0 but the critical path is increased by 4 (from A-B-E to A-B-E-F-G-I-J) and C crit X L = 1/(2 × 2) × 4 = 1. Therefore, P cost XL new is increased to 1. Because P cost XL new (1) is still smaller than P cost XL init (2.74), this move is also accepted.
• W. Xu and R. Tessier (3) Move {I } from E1 to E2, as shown in Figure 16 (d) . After this move, C reg X L stays unchanged while the critical path increase drops to 2 (from A-B-E to A-B-E-F-G). P cost XL new is reduced from 1 to 0.5 and the move is accepted.
In the previously mentioned example, the partition in Figure 16 (d) is recorded as the final best partition with a smallest P cost XL of 0.5. The final partition is E0 = {B, C, D}, E1 = {F, G}, and E2 = {I }.
Tetris-XL Serialization
Tetris-XL applies the same serialization step as Tetris. First, a serialization interference graph (SIG) is generated, as shown in Figure 17 (b) . The SIG includes six serialization nodes with the same Crit inc of one. The serialization node with the minimum number of interference edges, N deg, is selected and applied first. The serialization process proceeds as follows.
1. Select three serialization nodes with N deg of 0.
-{D → G} requires two serial edges (E to G) and (F to G).
-{C → G} requires two serial edges (E to G) and (F to G).
After
Step 2, there is no compatible serialization node left in the SIG since {D → F } interferes with both {D → G} and {C → G}. Figure 17 (d) , after applying the five compatible serializations, the augmented DDG contains three serial edges. The maximum register requirement of the augmented DDG is reduced from six to four with a critical path increase of 2. The new excessive set is {B, C, D, I } and the new critical path is A-B-E-F-G. Compared with the Tetris result shown in Figure 10(d) , Tetris-XL achieves the same register reduction of two with a smaller critical path increase of two, resulting in better performance.
As shown in
EXPERIMENTAL APPROACH AND RESULTS
To evaluate the effectiveness of Tetris and Tetris-XL reduction algorithms, a direct comparison to previous spill reduction techniques was performed. These techniques were implemented in an academic VLIW compiler, Trimaran, version 2.0 [Chakrapani et al. 2004] . Trimaran allows users to modify the number of target functional units (FUs), registers, and other resources to allow for examination of a broad range of VLIW architectures. Benchmarks in our experiments include a set of programs taken from the Trimaran framework [Chakrapani et al. 2004] and three applications taken from the MediaBench suite [Lee et al. 1997] . Benchmarks unepic, g721dec, and mpeg2dec are applications for image, audio, and video signal processing, respectively.
As shown in Figure 18 , our experiments include six flows. Flow 1 is the baseline Trimaran flow, which includes list scheduling followed by register allocation using graph coloring. To control register pressure, Flows 2, 3, and 4 apply Tetris, Tetris-XL, and greedy serialization, respectively, before scheduling. Greedy serialization [Touati 2005 ] was discussed at the end of Section 3. Flow 5 applies another spill reduction technique, MRIS [Govindarajan et al. 2003 ], before the default graph coloring register allocator in Trimaran. MRIS includes lineage generation/fusion and list scheduling, as discussed in Section 2. MRIS was previously used [Govindarajan et al. 2003 ] to perform register allocation before instruction scheduling on out-of-order issue superscalar processors. We consider MRIS to be a good candidate for comparison to Tetris for VLIW processors, since it has previously demonstrated a strong ability to perform schedule-sensitive register reductions. To limit the impact of false dependencies introduced by MRIS on performance, register reduction is only performed if the register requirement exceeds the number of available registers. Tetris and Tetris-XL also follow this restriction. In Flow 6, an enhanced register allocator with FBS is applied after the default list scheduling in Trimaran. FBS is a frequency-based live-range splitting technique [Kim 2001 ] described in Section 2.
All flows include dead code elimination, constant propagation, and loop unrolling prior to the steps shown in Figure 18 . The results reported by Trimaran do not consider dynamic effects such as interrupts or cache interactions. Except where noted, the commercial architectures modeled by our experiments do not include caches, limiting the impact of this issue on the target architecture. Benchmark cycle counts are determined by Trimaran following compilation by assessing the schedule length of each code block, including cycles to handle spills, multiplied by the number of times each code block is called. The reported execution cycle counts are static cycle counts. In the experiments, multiply and memory load operations require two clock cycles. Other functional unit operations require one clock cycle.
Experiments with High Register Pressure
The first VLIW architecture evaluated in our experiments is a 4-way VLIW architecture with 16 registers, which can execute 4 operations (including 2 memory operations) on every clock cycle. This resource configuration can be found in several low-end commercial VLIW processors including the Freescale MSC8101 and MSC8103 [Freescale Semiconductor, Inc. 2005] . These processors are often used in resource-constrained embedded systems. Our experiment evaluates the benefit of each individual technique for the 4-way architecture. Table I compares spills, the total number of spill operations executed in each benchmark, as reported by Trimaran. A spill operation is a memory store or load operation inserted by the register allocation. Spills are shown in thousands of values. For the baseline flow, spill ratio is also presented as the percentage of spills to ops, the total number of operations (including spill operations) executed in each benchmark. A high spill ratio indicates that a benchmark suffers from high register pressure. At the bottom of the table, the geometric average (GEOMEAN) of spills is provided along with the geometric average of the per-benchmark percent changes versus the baseline Trimaran flow. Geometric average is used for averaging the spill counts due to the presence of widely varying absolute spill values. Table II compares cycles, the total number of clock cycles required to execute each benchmark. Multiple operations are executed on each clock cycle. Cycles are also shown in thousands of values. At the bottom of the table, the geometric average of cycles is provided along with the geometric average of the perbenchmark percent changes versus the baseline Trimaran flow. Geometric average is used for averaging the cycle counts due to the presence of widely varying absolute cycle values. For the baseline (Flow 1) in Table I , on average, spills take up 46% of total executed operations, which indicates most benchmarks experience high register pressure. Among the five spill reduction techniques, Tetris-XL (Flow 2), Tetris (Flow 3), and MRIS (Flow 4) are the three most efficient techniques in terms of reducing spills and improving performance. Compared with the baseline, Tetris-XL (Flow 2), Tetris (Flow 3) and MRIS (Flow 4) reduce spills by 30%, 29%, and 27%, respectively. As for the performance improvement, Table II shows that, on average, Tetris-XL, Tetris, and MRIS reduce execution cycles by 30%, 25%, and 16%, respectively. The execution cycle reduction of a VLIW program is decided by two factors, spill reduction and critical path increase. A large spill reduction with a small critical path increase leads to a large reduction in execution cycles.
It is observed that Tetris-XL, Tetris, and MRIS reduce the maximum register requirement by reordering operations before instruction scheduling. Two additional experiments illustrate the benefit of Tetris-XL over other approaches. Because spills are closely related to the maximum register requirement, a first experiment evaluates the maximum register requirement of each benchmark after spill reduction techniques are applied. For example, the first bar in Figure 19 (a) shows that Tetris-XL reduces the maximum register requirement of benchmark bmm by 44% versus the Trimaran baseline flow. Tetris-XL is able to reduce spills of benchmark bmm by 33%, as shown by the first bar of Figure 19 (b) . The geometric average of results in Figure 19(a) shows that, on average, Tetris-XL, Tetris, and MRIS reduce the maximum register requirement by 34%, 27%, and 18%, respectively. These maximum register requirement reductions cause corresponding average spill reductions of 30%, 29%, and 27%, as shown in Figure 19 (b) .
The critical path increase caused by serialization plays an important role in deciding the final performance in terms of execution cycles. A second experiment compares the critical path of each benchmark after applying Tetris-XL, Tetris and MRIS. As shown in Figure 19 (c), Tetris-XL, Tetris, and MRIS increases the average critical path of benchmarks by 10%, 17%, and 29%, respectively, compared with the baseline flow. Overall, Tetris-XL achieved a 30% reduction in execution cycles, outperforming both Tetris (25%) and MRIS (16%). Therefore, the benefit of Tetris-XL is a result of its ability to limit critical path increases during register reduction. In contrast, Tetris and MRIS do not consider the potential adverse effects of critical path increases caused by serial edges. The results in Figure 19 indicate that Tetris and Tetris-XL are effective in improving performance for a range of applications, including multimedia benchmarks.
Applications with large datasets, such as scientific applications, which typically use a sizable number of registers, are likely to benefit more from our approach. As currently structured, Tetris and Tetris-XL are optimized for application runtime reduction rather than reduced compile time. As noted in Section 4, both Tetris and Tetris-XL examine all possible serializations of variables in the excessive set. Multiple iterations of the algorithm are performed until no further Max reg minimization can be achieved. This evaluation can lead to a compile time increase for designs with large code blocks. As shown in Table III , the large code blocks in the MediaBench benchmarks g721dec, unepic, and mpeg2dec lead to significant compile time increases versus baseline Trimaran compilation. The average code block size (about 75 lines) for these three benchmarks is two to three times the size of the other benchmarks. For the three multimedia benchmarks, nearly half of this compile time increase is due to the measure step [Touati 2005; Berson et al. 1993] used to determine Max reg after each Tetris iteration, not the Tetris algorithm itself. This compile time issue may be alleviated somewhat by limiting the use of Tetris and Tetris-XL to final pass compilation after code testing and optimization has been completed. In an additional experiment, the three large designs were restructured to achieve the same functionality, but with code block sizes reduced by about a factor of three. The results in Table IV demonstrate that the compile time can be reduced with continued runtime performance improvement versus the baseline flow. For the unepic and mpeg2dec designs, the baseline cycle count after restructuring improved on the initial coding. Baseline cycle count is slightly worse after recoding for g721dec. Geomean values in Table IV reflect averages across all designs, including those that have not been restructured.
Experiments with Low Register Pressure
The second VLIW architecture evaluated in our experiments is an 8-way architecture with 32 registers. This architecture can execute eight operations (including two memory operations) on every clock cycle. This configuration is similar to the VLIW processors found in the C62x and C67x families offered by Texas Instruments [Texas Instruments, Inc. 2000] . For a register size of 32, the baseline (Flow 1) in Table V has a spill ratio of 15%. This value indicates a relatively low but nontrivial register pressure. Among the five spill reduction techniques, Tetris-XL (Flow 2), Tetris (Flow 3), and MRIS (Flow 4) are still the most efficient techniques in terms of spill reduction. Compared with the baseline, Tetris-XL (Flow 2), Tetris (Flow 3), and MRIS (Flow 4) reduce spills by 45%, 56%, and 34%, respectively. As for performance improvement, Table VI shows that, on average, Tetris-XL, Tetris, and MRIS reduce execution cycles by 19%, 15%, and 7%, respectively.
It is observed from Table V that Tetris achieves 11% more spill reduction than Tetris-XL, on average. Tetris-XL is more conservative when register pressure is low because it evaluates both spill reduction and potential critical path increases. The aggressive spill reduction of Tetris comes at the cost of large Figure 20( c) , the average critical path increases caused by Tetris-XL, Tetris and MRIS are 9%, 17%, and 28%, respectively. Therefore, although Tetris-XL achieves less spill reduction compared with Tetris, it still outperforms Tetris and other techniques in terms of performance improvement.
Benchmark wave demonstrates the benefit of Tetris-XL. As shown in Figure 20 (b) and (c) , Tetris reduces spills of wave by 94% at the cost of a 11% increase in critical path length. Tetris-XL makes a trade-off between spill reduction and critical path increase so that a smaller spill reduction of 89% is achieved with a smaller critical path increase of 5%. As a result of this trade-off, Tetris-XL achieves an additional execution cycle reduction of 4% compared with Tetris, which is shown in Figure 20 (d) .
Experiments with Trivial Register Pressure
An 8-way VLIW machine with 64 registers was used for a final experiment. This architecture has the same basic FU and register configuration as the Transmeta Efficeon VLIW processor [Transmeta, Inc. 2005] and the Texas Instruments C64x processor [Texas Instruments, Inc. 2000] . On average, spills only take up 2% of the total operations in the baseline flow in Table VII . Due to this trivial register pressure, Table VIII shows that Tetris-XL, Tetris, and MRIS provide an average performance improvement of 2%, 1%, and 1%, respectively. As expected, the benefit of register pressure control becomes marginal when register pressure is very low.
SUMMARY AND FUTURE WORK
In this article, we present new spill reduction techniques to improve the performance of VLIW processors with limited registers. By modifying the relative ordering of operations, this technique serializes multiple variables simultaneously so that the register requirement can be reduced with a limited critical path increase. For VLIW programs that experience high register pressure, this technique reduces spills and improves execution time by identifying operations that are likely to create spills. For a 4-way VLIW architecture with 16 registers, our heuristic reduces the average execution time by an additional 14% compared with previous spill reduction techniques by simultaneously considering numerous variable serializations. For a 8-way VLIW architecture with 32 registers, the additional execution time reduction is 10%. For architectures with 64 registers, little additional execution time reduction is achieved. Several areas of future work are apparent from this research. The effect of cache sizes and protocols, as well as other dynamic microarchitectural features, could be considered in the context of high register pressure. Additionally, more trade-offs could be considered in Tetris to reduce compile time. Unlikely serializations could be pruned earlier in partitioning to reduce serialization search time. The number and type of functional units could be considered by an expanded version of Tetris-XL in an effort to avoid the optimization of infeasible schedules. Finally, a theoretical treatment of the conditions that lead to Max reg reduction could be explored.
